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Abstract. Graph data models have recently become popular owing to their applications, 
e.g., in social networks and the semantic web. Typical navigational query languages over 
graph databases — such as Conjunctive Regular Path Queries (CRPQs) — cannot express 
relevant properties of the interaction between the underlying data and the topology. Two 
languages have been recently proposed to overcome this problem: walk logic (WL) and 
regular expressions with memory (REM). In this paper, we begin by investigating funda¬ 
mental properties of WL and REM, i.e., complexity of evaluation problems and expressive 
power. We first show that the data complexity of WL is nonelement ary, which rules out 
its practicality. On the other hand, while REM has low data complexity, we point out that 
many natural data/topology properties of graphs expressible in WL cannot be expressed 
in REM. To this end, we propose register logic, an extension of REM, which we show to 
be able to express many natural graph properties expressible in WL, while at the same 
time preserving the elementariness of data complexity of REMs. It is also incomparable 
to WL in terms of expressive power. 


1. Introduction 

Graph databases have gained renewed interest dne to applications, snch as the semantic 
web, social network analysis, crime detection networks, software bug detection, biological 
networks, and others (e.g., see [T] for a survey). Despite the importance of querying graph 
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databases, no general agreement has been reached to date about the kind of features a prac¬ 
tical query language for graph databases should support and about what can be considered 
a reasonable computational cost of query evaluation for the aforementioned applications. 

Typical navigational query languages for graph databases — including the conjunctive 
regular path queries [7] and its many extensions [3] — suffer from a common drawback: 
they are well-suited for expressing relevant properties about the underlying topology of a 
graph database, i.e., about the way in which (labeled) nodes are connected via (labeled) 
edges, but not about how such topology interacts with the node ids or the data. This 
drawback is shared by common specification languages for verification [6] (e.g. CTL*), 
which are evaluated over a similar graph data model (a.k.a. transition systems). Examples 
of important queries that combine graph data and topology, but cannot be expressed in 
usual navigational languages for graph databases, include the following [8l|T3]: (Ql) Find 
pairs of people in a social network connected by professional links restricted to people of the 
same age. (Q2) Find pairs of cities x and y in a transportation system, such that y can 
be reached from x using only services operated by the same company. In each one of these 
queries, the connectivity between two nodes (i.e., the topology) is constrained by the data 
(from an infinite domain, e.g., N), in the sense that we only consider paths in which all 
intermediate nodes satisfy a certain condition (e.g. they are people of the same age). 

Two languages, walk logic and regular expressions with memory, have recently been 
proposed to overcome this problem. These languages have different goals: 

(a) Walk logic (WL) was proposed by Hellings et al. [8] as a unifying framework for 
understanding the expressive power of path queries over graph databases. Its strength is 
on the expressiveness side. The underlying data model of WL is that of (node or edge)- 
labeled directed graphs. In this context, WL can be seen as a natural extension of FO 
with path quantification, plus the ability to check whether positions p and p' in paths vr 
and tt', respectively, have the same data values. In their paper, Hellings et al. assume the 
restriction that each node carries a distinct data value (and, therefore, that this data value 
serves as an identifier for the node). However, as we shall see, this makes no difference in 
terms of the results that we can obtain. 

(b) Regular expressions with memory (REMs) were proposed by Libkin and Vrgoc [TO] 
as a formalism for comparing data values along a single path, while retaining a reasonable 
complexity for query evaluation. The strength of this language is on the side of efficiency. 
The data model of the class of REMs is that of edge-labeled directed graphs, in which 
each node is assigned a data value from an infinite domain. REMs define pairs of nodes 
in the graph database that are linked by a path satisfying a given condition c. Each such 
condition c is defined in a formalism inspired by the class of register automata [9], allowing 
some data values to be stored in the registers and then compared against other data values. 
The evaluation problem for REMs is PsPACE-complete (same as for FO over relational 
databases), and can be solved in polynomial time in data complexity [10], i.e., assuming 
queries to be fixed0 This shows that the language is, in fact, well-behaved in terms of the 
complexity of query evaluation. 

The aim of this paper is to investigate the expressiveness and complexity of query 
evaluation for WL and the class of REMs with the hope of finding a navigational query 

^Recall that data complexity is a reasonable measure of complexity in the database scenario [13 , since 
queries are often much smaller than the underlying data. 
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language for data graphs that strikes a good balance between these two important aspects 
of query languages. 

Contributions. We start by considering WL, which is known to be a powerful formalism in 
terms of expressiveness. Little is known about the cost of query evaluation for this language, 
save for the decidability of the evaluation problem and NP-hardness of its data complexity. 
Our first main contribution is to pinpoint the exact complexity of the evaluation problem for 
WL (and thus answering an open problem from [8] ): we prove that it is non-elementary, and 
that this holds even in data complexity, which rules out the practicality of the language. 

We thus move to the class of REMs, which suffers from the opposite drawback: Although 
the complexity of evaluation for queries in this class is reasonable, the expressiveness of 
the language is too rudimentary for expressing some important path properties due to its 
inability to (i) compare data values in different paths and (ii) express branching properties 
of the graph database. An example of an interesting query that is not expressible as an 
REM is the following: (Q) Find pairs of nodes x and y, such that there is a node z and a 
path vr from x to y in which each node is connected to z. Notice that this is the query that 
lies at the basis of the queries (Ql) and (Q2) we presented before. 

Our second contribution then is to identify a natural extension of this language, called 
register logic (RL), that closes REMs under Boolean combinations and existential quantifi¬ 
cation over nodes, paths and register assignments. The latter allows the logic to express 
comparisons of data values appearing in different paths, as well as branching properties of 
the data. This logic is incomparable in expressive power to WL. Besides, many natural 
queries relating data and topology in data graphs can be expressed in RL including: the 
query (Q), hamiltonicity, the existence of an Eulerian trail, bipartiteness, and connected 
graphs with an even number of nodes. We then study the complexity of the problem of query 
evaluation for RL, and show that it can be solved in elementary time (in particular, that it 
is ExpSPACE-complete). This is in contrast to WL, for which even the data complexity is 
non-elementary. With respect to data complexity, we prove that RL is PsPACE-complete. 
We then identify a slight extension of its existential-positive fragment, which is tractable 
(NLogspace) in data complexity and can express many queries of interest (including the 
query (Q)). The idea behind this extension is that atomic REMs can be enriched with an 
existential branching operator - in the style of the class of nested regular expressions [5] - 
that increases expressiveness without affecting the cost of evaluation. 

Organization of the paper. Section [2] defines our data model. In Section [3l we briefly 
recall the definition of walk logic and some basic results from [8] . In Section U we prove 
that the data complexity of WL is nonelementary. Section [5] contains our results concerning 
register logic. We conclude in Section [6] with future work. 

2. The Data Model 

We start with a definition of our data model: data graphs. 

Definition 2.1 (Data graph). Let S be a finite alphabet. A data graph G over S is a tuple 
{V,E,k,), where V is the finite set of nodes, E C. V x T, x V is the set of directed edges 
labeled in S (that is, each triple (v, a, v') € E is to be seen as an edge from v to v' in G 
labeled a), and k : R —)■ D is a function that assigns a data value in T> to each node in V. 

This is the data model adopted by Libkin and Vrgoc m in their definition of REMs. 
In the case of WL [8], the authors adopted graph databases as their data model, i.e., data 
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graphs G = {V,E,k) such that k is injective (i.e. each node carries a different data value). 
In such a case we can think of k{v) as the identifier (id) of v, for each v & V. We shall 
adopt the general model of [TO] since none of our complexity results are affected by the 
data model: upper bounds hold for data graphs, while all lower bounds are proved in the 
more restrictive setting of graph databases. However, for the sake of the comparison with 
the expressiveness of WL, many of our examples are constructed in the scenario of graph 
databases, that is, when k{v) serves as an id for node v. 

There is also the issue of edge-labeled vs node-labeled data graphs. Our data model is 
edge-labeled, but the original one for WL is node-labeled [8]. We have chosen to use the 
former because it is the standard in the literature |2]. Again, this choice is inessential, since 
all the complexity results we present in the paper remains true if the logics are interpreted 
over node-labeled graph databases or data graphs (applying the expected modifications to 
the syntax). 

Finally, in several of our examples we use logical formulas to express properties of 
undirected graphs. In each such case we assume that an undirected graph H is represented 
as a graph database G = (F, E, k) over unary alphabet S = {a}, where V is the set of nodes 
of H and FI is a symmetric relation (i.e. {v,a,v') € FI iff (v',a,v) € E). In particular, since 
G = {V,E,k) is a graph database we have that k is injective, i.e., each node is uniquely 
determined by its data value. 


3. Walk Logic 

WL is an elegant and powerful formalism for defining properties of paths in graph databases, 
which was originally proposed in [8] as a yardstick for measuring the expressiveness of 
different path logics. 

The syntax of WL is defined with respect to countably infinite sets H of path variables 
(that we denote as 7r,7ri,7r2,...) and T(7r), for each vr € H, of position variables of sort tt. 
We assume that different sorts are associated with distinct position variables. We denote 
position variables by t,ti,t 2 , ■ ■ ■, and write G when we need to emphasize that position 
variable t is of sort vr. 

Definition 3.1 (Walk logic (WL)). The set of formulas of WL over finite alphabet S is 
defined by the following grammar, where (i) a G S, (ii) t,ti,t 2 are position variables of any 
sort, (hi) vr is a path variable, and (iv) tj,t 2 are position variables of the same sort vr: 

</>,(/>' := Ea{ti,t2) I ti < ^2 I ~ ^2 I ~^4> I 4’'^ 4^' I ^^4 I ^TTc/) 

As usual, WL formulas without free variables are called Boolean. □ 

To define the semantics of WL we need to introduce some terminology. A path (a.k.a. 
walk in 0) in the data graph G = {V, E, k) is a finite, nonempty sequence 

P — Vi(liV2 * * * Vfi—ldn—l'^n^ 

such that {vi,ai,Vi+i) € E for each 1 < i < n. The set of positions of /? is {1,... , n}, and Vi is 
the node in position i of p, for \ < i < n. The intuition behind the semantics of WL formulas 
is as follows. Each path variable vr is interpreted as a path p = viaiV 2 ■ ■ -Vn-ian-iVn in 
the data graph G, while each position variable t of sort vr is interpreted as a position 
1 < i < re in p (that is, position variables of sort tt are interpreted as positions in the 
path that interprets vr). The atomic formula Ea{ti,t 2 ) is true iff vr is interpreted as path 
p = viaiV 2 ■ ■ ■ Vn-ian-iVn, the position p 2 that interprets t 2 in p is the successor of the 
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position Pi that interprets ti (i.e. p 2 = pi + 1), and node in position pi is linked in p by an 
a-labeled edge to node in position p 2 (that is, = a). In the same way, holds iff 

in the path p that interprets tt the position that interprets ti is smaller than the one that 
interprets t 2 - Furthermore, ti ~ t 2 is the case iff the data value carried by the node in the 
position assigned to ti is the same than the data value carried by the node in the position 
assigned to t 2 (possibly in different paths). We formalize the semantics of WL below. 

Let G = {V, E, k) be a data graph and (p a WL formula. Assume that S^j, is the set that 
consists of (i) all position variables E and path variables vr such that E is a free variable of p, 
and (ii) all path variables vr such that tt is a free variable of p. Intuitively, 5,^ defines the set 
of (both path and position) variables that are relevant to define the semantics of p over G. 
An assignment a for p over G is a mapping that associates a path p = viaiV 2 ■ ■ ■ Vn-ian-iVn 
in G with each path variable vr € and a position 1 < z < n with each position variable of 
the form G in (notice that this is well-defined since vr G every time a position variable 
of the form G is in S^). As usual, we denote by a[t —>■ i] and Q:[7r p] the assignments 
that are equal to a except that t is now assigned position i and vr the path p, respectively. 

We say that G satisfies cp under a, denoted {G, a) \= p, if one of the following holds 
(we omit Boolean combinations which are standard): 

• p = Ea{t\fi 2 )-: path a(7r) is viaiV 2 ■ ■ ■ Vn-ian-iVn-, and it is the case that a{t 2 ) = 

a{t'^) + 1 and a = aa(ti)- 

• p = < tip and a(tj) < a(tp). 

• p = {ti ^ t 2 ), ti is of sort TTi, t 2 is of sort 712, and k{vi) = k{v 2 ), where Vi is the node in 
position a{ti) of a(vrj), for z = 1,2. 

• p = and one of the following holds: 

(1) E does not appear free in z/:, or 

(2) both E and vr appear free in ip, and there is a position i in Q!(7r) such that {G, a[E —)■ 
z]) ^ p, or 

(3) E appears free in vr does not appear free in p^ and there is a path p in G and a 
position i in p such that (G, apK p,E ^ z]) |= p. 

• p = Birp and the following holds: 

(1) TT does not appear free in p, or 

(2) there is a path p in G such that (G, a[7r —> p]) \= p. 

Example 3.1. A simple example from [8] that shows that WL expresses NP-complete 
properties is the following query that checks if a graph G has a Hamiltonian path: 

Btt {ytiytp {tj P tp ^ tp p tp) A \/7r'ytf3tp{tp' ~ tj) )• 

In fact, this query expresses that there is a path vr in G that does not repeat nodes (because 
vr satisfies VtpVtpPi p tp -A tj p tp)), and every node belongs to such path (because vr 
satisfies \lE\/t\ 3tp{tp ~ tp), and, therefore, every node that occurs in some path E in the 
graph database also occurs in vr). Note that this formula uses in an essential way the fact 
that G is a graph database, i.e., that each node is uniquely identified by its data value. □ 


4. WL Evaluation is Non-elementary in Data Complexity 

In this section we pinpoint the precise complexity of query evaluation for WL. It was proven 
in [8] that this problem is decidable. Although the precise complexity of this problem was 
left open in [8], one can prove that this is, in fact, a non-elementary problem by an easy 
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translation from the satisfiability problem for FO formulas - which is known to be non¬ 
elementary [ElES]. In databases, however, one is often interested in a different measure 
of complexity - called data complexity m - that assumes the formula cj) to be fixed. This 
is a reasonable assumption since databases are usually much bigger than formulas. Often 
in the setting of data complexity the cost of evaluating queries is much smaller than in the 
general setting in which formulas are part of the input. The main result of this section 
is that the data complexity of evaluating WL formulas is nonelementary even over graph 
databases, which rules out its practicality. 

Let (j) he a WL formula without free variables. The evaluation problem for cj), denoted 
Eval(WL,(/)), is defined as follows: Given a data graph G, is it the case that G \= (j)? We 
prove the following: 

Theorem 4.1. The evaluation problem for WL is non-elementary in data complexity. In 
particular, for each k G Z> 0 ) there is a finite alphabet S and a Boolean formula (j) over S, 
such that the problem Eval(WL,(/>) of evaluating the WL formula cj) is fc-ExpSPACE-hard. 
In addition, the latter holds even if the input is restricted to the class of graph databases. 

We prove the above result by showing that for all natural numbers k, the data complex¬ 
ity of the model checking problem for WL is fc-ExpSPACE-hard. For all natural numbers 
k and /o, we provide a reduction to the class of problems solvable by a Turing machine 
using a tape of size tower(k, fon) given an input word of size n, where tower{l,n) := 2"' 
and tower{k + l,n) = 

More precisely, for all natural numbers A: > 0, there is a Turing machine M and a 
constant /o such that the following problem is /c-ExpSPACE-hard: given a word w of size 
n, is there an accepting run of M over w using at most tower{k, fon) cells? We prove that 
there is a formula (j) G WL such that for all words w of size n, there is a graph G^j such that 

Gw \= (p iff there is an accepting run of M over w using at most tower{k, fon) cells. 

(4.1) 

Before giving a proof, we sketch the case k = 1 here, which illustrates the proof idea. Let M 
be a Turing machine M such that the following problem is ExpSPACE-hard: given a word 
w of size n, is there an accepting run of M over w using at most 2-^°"' cells? The formula (j) 
that we will define and satisfying equivalence (14.ip is of the form 

where is a formula that does not contain any quantification over path variables. Given a 
word w of size n, the label of the path vr in the graph Gw will encode an accepting run of 
M over the word w in the following way. 

Given a word w of size n, consider a configuration C of the run of M over w where the 
head is scanning the cell number io, the machine is in state q and the content of the tape 
is the word w' = Wq ... Wj (j = 2-^°^ — 1 ). We may encode the configuration G by the word 
ec = do ■■ - dff where each df encodes the information in cell number i and j = 2-^°” — 1. 
More precisely, we define df as a word of the form 

cii) (qfw'i), (4.2) 

where c{i) and q[ are defined as follows. The word c(i) is the binary encoding of the number 
i. The letter is the content of the cell i. The letter q'- is equal to the dummy symbol $ 
if the head is not scanning the cell number otherwise, q[ is equal to the state q. That is, 
qf = q and for all i ^ io, q^ = $. We encode a run CqCi ... as the sequence ecoCci • • • • 
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We think of a path vr encoding a run as consisting of two parts: the first part contains 
the encoding ccq of the initial configuration and is a path through a subgraph of Gw, 
while the second part contains the encoding eci ecj • • • and is a path through the subgraph 
H of Gw If Q is the set of states of M and S is the alphabet, we define H as the following 
graph 



where I is equal to |(Q U {$}) x S|, {d* : 1 < i < 1} = {Q L) {$}) x S and the number of 
nodes with outgoing edges with labels 0 and 1 is equal to /on. The label of a path vr' from 
the “left-most” node x to the “right-most” node z with only once occurrence of x is exactly 
the description of a cell in a conhguration: it is the binary encoding of a natural number 
< followed by a pair of the form a). We can define a formula cpc G WL such that 
for all paths vr starting in x and ending in z, 

H \= 4>ci'^) iff tffs label of tt is the encoding of a conhguration. 

We do not give details; 4>c has to express that the encoding of a conhguration only has one 
tape head, that the hrst number encoded in binary is 0, that the last number is 2-1°"' — 1 
and that the encoding of the description of cell number j is followed by the description of 
cell number j + 1. Using the formula (pc, we can dehne a formula /i such that for all paths 

H 1= </i(7r) iff the label of vr is the encoding of an accepting run. 

The formula <pi has to ensure that if ecec' occurs in the label of tt, then G and G' are 
consecutive conhgurations according to M. Moreover, <pi has to express that eventually we 
reach the hnal state. In order to express <pc and (pi, we use the ability of WL to check 
whether two positions correspond to the same node. For example, in order to define /i, 
since we need to compare consecutive configurations ec and ec', we need to be able to 
compare the content of a cell in configuration G and the content of that same cell in G'. In 
particular, we want to be able to express whether two subpaths ttq and vr'^ of tt starting in 
X and ending in y correspond to the binary encoding of the same number. Since the length 
of such subpaths depends on n, we cannot check node by node whether the two subpaths 

vr' vr' 

are equal. However, it is sufficient to check that if tQ° and correspond to the same node 

(tQ° ~ then their successors also correpond to the same node (tQ° -|- 1 ~ -|- 1). Note 

that using the facts that ttq and are subpaths of tt, we will be able to define (pi such 
that it only contains quantifications over node variables (and no quantifications over path 
variables). Similarly, in the formula (pc, we use the operator ~ in order to express that two 
subpaths correspond to the binary encodings of numbers that are successors of each other. 

Similarly to the way we define the graph H, we can introduce a graph and a formula 
00 (vr) such that 

Iw 1 = 00 (vr) iff the label of tt is the encoding eco, 

where Gq is the initial configuration of the run of M over w. By adding an edge from to 
H, we construct a graph G^ such that for all paths tt, Gw L 0o(vr) A 0i(7r) iff the label of 
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TT is the encoding of an accepting run over w. Hence, the formula (p := 37r((/>o(7r) A </>i(7r)) 
satisfies 

For the case where k > 1, the problem to adapt the above proof is that we have to 
consider Turing machine configurations whose size is bounded by a tower of exponentials of 
height k. If k > 1, the binary representation of such a bound is not polynomial. The trick is 
to represent such exponential towers by fc-counters. A 1-counter is the binary representation 
of a number. If /c > 1, a fc-counter is a word cto/q • • • where Ij is a (/c — l)-counter and 

aj E {0,1}. 

Definition. For all natural numbers k, we consider the alphabet = {ak,bk}, where ak 
and bk represent 0 and 1 respectively. We define F^ as the alphabet Si U • • • U S^. 

A 1-counter of length re is a sequence of the form 

Iq . . . IfQTl—ll 

where for all 0 < z < /ore, E Si. This 1-counter represents the number X]f=o ^ ^*2*. Recall 
that if li is equal to oi (resp. 6i), then li represents 0 (resp. 1). 

If fc > 2, a k-counter of length re is a sequence of the form 

(TqIq . . . CTjlj, 

where for all 0 < z < j , li E S^, cjj is a (A; — l)-counter representing the number z and 
j = tower{k — 1,/ozz) — 1. This /c-counter represents the number Again recall 

that if li is equal to oi (resp. 6i), then li represents 0 (resp. 1). 

A {k, fan, p)-description (over an alphabet A) is a sequence 

o'pdp ... 0'jdj , 

where for all p < z < j , di E A, cTj is a (A: — l)-counter representing the number z and 
j = tower{k, fo{n — 1)) — 1. A {fQk,n)-description (over an alphabet A) is a (A:,/ore,0)- 
description. □ 

Note that a {k, /ore)-description over the alphabet S^ is a fe-counter of length re. If A 
is the alphabet {Q U {$}) x S (where Q is the set of states and S is the alphabet of the 
machine), a (A:,/ore)-description over A is of the form 

lo{xo,yo) ...ljixj,yj) 

where j = tower{k, f^n) — 1. Hence, if we dehne c(z) in (|4.2I1 as the A:-counter encoding 
the number i, the encoding of a configuration (as defined above) is nothing but a {k, /ore)- 
description. 

In particular, if we want to encode a run as the label of a path satisfying some 
well-chosen formula in a well-chosen graph, we should also be able to encode {k, fon,p)- 
descriptions as labels of paths. We show how to do so in the following lemma. 

Notation. Given a path vr in a graph over an alphabet A, we denote by /(zr) the label of 
TT. Given an alphabet A' C A, we denote by lA'i'^) the trace of /(zr) over the alphabet A', 
that is, the subsequence of /(zr) obtained by deleting the letters that do not belong to A'. 

Let G' = (y',E',K') be a subgraph of G = (y,E,K) and let zr be a path in G and of 
the form 

Vi(liV2 ■ ■ ■ Vn—lCln'^n^ 

where (uj, Oj, zzj+i) E S for all I < z < re. Assume that there are zq and zi such that zq < zi 
and 

{vi : Vi €V',1 < i < n} = {vi^,.. .,Vi^}, 
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that is, once the path leaves G', it never goes back to G'. Then we define the trace tt' of tt 
on G' as the subpath 

that is, tt' is the longest subpath of (j) with nodes in G'. 

In order to make notation easier, we also abbreviate the formula 

3s Ea{s,t) 

by a{t). 

Given a formula s, t) with path variable vr and node variables s and t, we denote 
by 4>{'7Ts,t) the formula obtained by replacing in (/((tt, s, t) each quantification of the form 

3r^ 


by 

s.t. (s < r < t). 

Intuitively, we “restrict” the path tt to the nodes occurring between s and t. 

Lemma 4.2. For all n and k and for all alphabets A, there are formulas (0 < p < 

n) and a graph G^^ satisfying the following. There is a unique node with an outgoing (resp. 
incoming) edge with label (resp. /^„); moreover, that node has no incoming (resp. 
outgoing) edge. That node is called the initial (resp. final) node. Finally, G^^ 1= 
iff the label 1{tt) of tt satisfies the following conditions: 

• only the first edge of tt is labeled i^^, 

• only the last edge of tt is labeled 

• if /c > 2 and A' = A U Ffc_i, then ^a'(^) is a (/c, /on,p)-description over A; 

• if /c = 1, /si(7r) is a 1-counter of length n. 

We let be an abbreviation for 4>k'no(^)- 

Moreover, if A = then there are formulas SMCCfc^„(7r, vr'), numher\^ (1 < f < u), 
lastk^n and egfc „(7r, tt') such that for all paths tt and tt' satisfying G^^ ^ A(/)^„(7r'), 

we have 

• G^^ 1= succfc^„(7r, tt') iff the number encoded by ^rj,(vr') is the successor of the number 
encoded by lr^{TT). 

• G^^ \= number\. ^{tt) iff ^rj,(7r) is the encoding of the number i. 

• ^ l(^stk,n{T^) iff hki'^) is the encoding of the number tower{k, fon). 

• eQk iff the number encoded by lr^.{TT') is equal to the number encoded by 

^rfc(vr). 

Proof. The formulas and the graph are defined by induction on k. Suppose hrst that k = 1 
and A = Si. We dehne Go as the following graph 
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where the number of nodes with outgoing edges with labels ai and bi, is equal to /on. The 
label N is an additional label that we introduce in order to simplify the notation in the 
formulas. 

We let be the graph Gq. We define now the formula In fact, any path vr 

over Go starting with the node with no incoming edge and ending with the node with no 
outgoing edge, will be such that /si (t) is the encoding of a 1-counter. Hence, we can define 
as the conjunction of the formula 

3s^[--3G,t < s] 

and the formula 

< t]. 

We show now how to define the formulas num\^{'K) (by induction on i), „(7r, vr') and 

lasti^niT^)- For the formula /asfi^n(7r), a path vr corresponds to the encoding of the number 
2fo^ — 1 iff we always choose the node with label bi. Or equivalently, if we never choose the 
node with label ai. Hence, we may define lasti^n{T^) as the formula 

For the formula eqi „(7r, tt'), two paths tt and tt' correspond to the same number iff tt and tt' 
are equal. Since tt and tt' are simple paths with the same starting node, this is equivalent 
over graph databases (where each node carries a different data value) to the fact the the 
following formula holds 

+ 1 )]. 

The formulas numj „(7r) is defined by induction on i. \i i = 0, the path tt encodes the 
number 0 iff we always choose the node with label ai. Or equivalently, if we never choose 
the node with label 6i, which is expressed by 

For the induction case, the path tt encodes the number i -|- 1 iff there is a path tt" encoding 
the number i and the number encoded by tt is the successor of the number encoded by tt". 
Hence, we can define numYnY) a-s the formula 

3TT"{num\ nY") ^ succi^nY",Y)- 

In order to hnish the base case, it remains to define the formula succi^nY,T^')- Basically, 
we have to simulate addition in binary. If xi.. -Xf^n is the binary encoding of a number 
i < 2^°" — 1, then the binary encoding of the number i -|- 1 is the sequence x'l ... x'j-^^ such 
that Xm is equal to 

(a) 1 if Xm = 0 and all the elements Xm+i, ■ ■ ■ ,Xf^^n are equal to 1, 

(b) 0 if Xm = 1 and all the elements Xm+i, ■ ■ ■ ,Xf^^ri are equal to 1, 

(c) 0 if Xm = 0 and there is an element in the sequence Xm+i ■ ■ ■ x^n that is equal to 0, 

(d) 1 if Xm = 1 and there is an element in the sequence Xm+i ■ ■ ■ Xf^n that is equal to 0 . 

Case (a) can be expressed by the following formula 

~ A ai{t) A Vs G 7r[(t < s) A N{s — 1) —> &i(s)]] —^ bi{t'). 

The other cases can be treated similarly. This finishes the base case. 

We turn now to the induction step. If A = {di,..., d;}, we define G^^^^ as the 
following graph 
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di 



The edge with label ^ and the edge with label are pointing to the initial node in 

The edge with label is an edge starting from the final node in G^^. 

We define now the formula 4>k'+i n pi'^)■ intuition is as follows. We encode a 

(/c + 1, /on,p)-description 

0'pdp ... 0"jdj ^ 

as a path tt starting with the edge with label i^_^i „ and ending with the edge with label 
fk+i n- Each /c-counter ai will correspond to a path through the subgraph G^^, while di will 
correspond to the label of an edge occurring after the edge with label The formula 

^k+i needs to ensure that the following hold: 

(a) The first edge of tt is the edge with label 

(b) Each “passage” of the path tt through the graph G^^ corresponds to the encoding of 

a /c-counter. To express this, we will use the formula 4'kn('^) given by the induction 
hypothesis. 

(c) The first time the path tt “goes through” the graph G^^ corresponds to the encoding 
of the number p. 

(d) Two successive “passages” of tt through the graph G^^ correspond to two successive 
/c-counters. 

(e) The edge with label f^+in occurs after the edge with label iff the last passage 

of the path vr through the graph G^^, corresponded to the encoding of the number 
tower{k, fon). This ensures that we fully encode a (/c + 1,/on)-description, and not a 
subsequence of it. 

We only show how to express (b) as this is one of the most difficult cases and the other ones 
can be treated similarly. 

For (b) we have to express that each passage of tt through the graph G^^ corresponds 
to the encoding of a /c-counter. Recall that by the induction hypothesis, since a (/c, fon)- 
description over is a /c-counter of length k, the formula 4’kni'^') ^ true in the graph G^^ 
iff /rfc(viO is the encoding of a /c-counter of length n. 

Hence, in order to express (b), it is enough to ensure that if s is the first node of a 

y^ 

passage of tt through G^ ^ and if t is the last node of that same passage, then the formula 
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holds. We introduce a formula IFk,n{s,t,7r) such that 

IFk,n{s,t,TT) holds iff s is the first node of a passage of vr through (4-3) 

and t is the last node of that same passage. 

We define vr) as the formula 

{t - l,t) /\ {s < t) A -<3u'" [{s < u < t) a\J di{u)]. 

J k,n 

I 

This equivalent to saying that s is the initial node of that t is the final node of G^^ 
and the path “never goes out” of the graph G^^ (this can be enforced by imposing that we 
do not go through the edge with label di for some i). 

We define now the formula tt) expressing condition (b), that is, if s is the first 

node of a passage of vr through G^^ and if t is the last node of that same passage, then the 
formula 4’'kni'^s,t) holds. By (|4.3n . we may define xi(s,t,7r) as the formula 

IFk,nis,t,TT) 


We turn now to the definitions of the formulas succk+i^niT^), 

and lastk+i^niT^)- The formulas succk+i,n{'^)-, ^{tt) and lastk+i^n{'^) are defined in 

a similar fashion as the basis case {k = 1). 

In order to define the formula ^(vr, vr'), let vr and vr' be two paths satisfying the 

y 

formula Recall that tt corresponds to the encoding of a (A: + l)-counter 


^\d\ ... 0'jdj^ 

y 

where each ai corresponds to a passage vr^^t of vr through G^^ and di corresponds to the 
label of an edge occurring right after that passage. Given the structure of the graph Gf^i ni 
that edge is the incoming edge of the node t + 2. 

The paths vr and vr' correspond to the encoding of the same {k + l)-counter if for all 
passages TTg^t of vr through G^^ and for all passages vrs/^^/ of vr through G^^ such that vTs,* 
and TTs',t' encode the same /c-counter, we have that t + 2 and t' + 2 are the same nodes. 
By (j4.3p and by the induction hypothesis, this can be expressed by the following formula 
e9A:+i,n(7r,vr') given by 

Vs'", r V(s')'"', {t'Y' [IF{s, t, vr) A IF{s', t' , vr') A eqk^Y'^s,t, T^s',t') -A {t + 2 t' + 2)] 

This finishes the proof of Lemma 14.21 Q 


We are now ready to prove Theorem 14.11 

Proof of Theorem \4.1\ As explained earlier, we prove that for all Turing machines M and 
for all k, there is a formula (p € WL such that for all words w of size n, there is a graph Gw 
such that 


Gw L (/> iff there is an accepting run of M over w using at most tower{k,n) cells. 

Let (S, Q, 6, qi, qf) be the Turing machine M, where S is the input alphabet together with a 
blank symbol B, go is the initial state, g/ is the final state and 6 : Q xT, ^ (Q xT, x {L, i?}) 
is the transition map, where L stands for “left” and R stands for “right”. 
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The formula 4> is a formula of the form 

where if; is a formula that does not contain any quantihcation over path variables. Given 
a word w, the label of tt in the graph Gw is the encoding of an accepting run of M over 
the word w. Recall that we encode a configuration of the machine in the following way. 
Suppose that C is a configuration where the content of the tape is the word w' = Wq ... Wj 
(j = tower{k, /qu) — 1), the head is scanning the cell number zq and the machine is in state 
q. We may encode C by the word ec = cIq .. .d^ where each di is a sequence 

c(*) (q'wWi), 

and c{i)^w[ and q[ are dehned as follows. The word c{i) is the fe-counter encoding the 
number i. The letter is the content of the cell i. The letter g' is equal to $ if the head is 
not scanning the cell number i; otherwise, q[ is equal to the state q. This implies that given 
a conhguration C, the word ecc is a (A; +1, n)-counter over the alphabet A := ((5u{$}) x S. 

The run of M over the word re is a sequence of configurations of the form CqCi .... We 
encode the run as the word ecoGCi ■ ■ ■ (which is a sequence of {k + 1, n)-counters). We will 
define the formula the graph Gw in such a way that a path vr satisfies iff the 

projection of the label of vr on the alphabet T^ U A is the encoding of an accepting run of 
M over w. 

We think of a path vr encoding a run of M over w as consisting of two parts. The label 
of the first part contains the encoding ccq of the initial configuration Cq. The label of the 
second part contains the encoding ec^ecj • • • of remaining part of the run. The first 
part of the path vr is a path in a subgraph Iw of Gw, while the second part is a path in the 
subgraph H (independent of w) of Gw- The graph Gw will be obtained by adding an edge 
from a node of Iw to a node of H. 

We start by defining the graph H. Recall that A is the alphabet {Q U {$}) x S. The 
graph H is defined as the graph G^^ with an additional edge from the hnal node to the 
initial node. Hence, it follows from the proof of Lemma 14.21 that H is the following graph, 


di 



where A = {di,... ,di} and where the edges with label and are edges pointing to 

the initial node of and the edge with label A*^+^ is an edge starting from the hnal node 
of In the above paragraphs, any edge pointing to the graph G^^^ ^ is an edge pointing 
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to the initial node of that graph. Similarly, any edge starting from the graph ^ will 
always be referring to an edge starting in the final node of the graph. 

Recall that if a path vr encodes a run C'oC'iC '2 ..., the trace of vr on // will encode the 
part C 1 C 2 ... of the run. Each configuration Ci is encoded as a (A: + 1,/on)-description 
over A, which will correspond, as in Lemma 14.21 to a passage of the path vr from the initial 
node to the hnal node of G^^^ 

We define now the graph encoding the initial configuration of the tape. Recall that 
in the initial configuration, the tape contains the word w = wq ... Wn-i, all the cells with 
number > n contain the blank symbol B, the head is scanning the first cell and the state is 
qo- The graph is obtained by “assembling” the subgraphs Kq, ..., K^-i and K, which we 
will define next. For each i < n, the graph Ki is such that the label of its unique maximal 
path is the encoding of the cell number i in the initial configuration. The trace of the path 
TT on the graph K is the encoding of the contents of the cells with number > n in the initial 
configuration. 

More precisely, we define the graphs Kq, ..., iL^-i and K in the following way. The 
following graph is the graph Kq 

H,wo) 

- 

The node with label in will be the starting node of the path tt. Since the trace of vr in Ki 
encodes the content of the cell with number 0 in the initial configuration, its label must 
contain the /s-counter encoding the number 0 followed by the letter {qi,WQ) of the alphabet 
A (indicating that the hrst cell contains the letter wo, the head is scanning the first cell and 
the current state is qo). Using Lemma 14.21 and the formula num^ we will impose that the 

passage of vr through the subgraph of Kq corresponds to the encoding of the number 

0 . 

Next, for all 1 < z < n, we dehne Ki as the following graph 



Recall that we want to define Ki in such a way that the trace of vr on Ki is the encoding 
of the contents of the cell with number in in the initial conhguration (that it, it contains 
the letter Wi and the head is not scanning the cell since i ^ 0). Recall that the encoding of 
such a cell (and its content) is given by 

c(z)($,r(;i), 

where c(i) is the /c-counter encoding the number i. We will use the formula num\. ^ given by 
Lemma 14.21 to express that the passage of vr through the subgraph G^^ of Ki corresponds 
to the A:-counter encoding i. 

Finally we define the graph K as the graph 
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where is the one-letter alphabet containing the blank symbol B. Recall that the trace 
of vr on the graph K will encode the contents of the cells with number > n in the initial 
configuration (that is, the fact that those cells contain the blank symbol and are not scanned 
by the head). Since the encoding of such a cell with number i is given by 

c(i)($,R) 

(where c{i) is the fc-counter encoding i), the label of the trace of vr on R' must contain the 
word 

c(n-h l)(R,$)...c(j)(R,$), 

where j = tower{k, n) — 1. That is, the label of the trace of vr on R' is the unique {k, fon,n + 
l)-description over the alphabet A^. We will express that the passage of vr through the 
graph corresponds to the {k,fon,n + l)-description over the alphabet A^ using the 
formula 4>tnn+i Provided by Lemma 021 

We are now ready to define the graph which is obtained by assembling the graphs 
previously introduced in the following way. 



Each edge between two graphs in the picture above is an edge from the “left-most” 
node of the first graph to the “right-most” node of the second graph. Finally the graph 
is the graph obtained by considering the union of the graph and H and adding an edge 
from the final node of K to the initial node of H. 

Now that we have defined the graph G^, we are ready to define the formula v/i. The 
formula is obtained as the conjunction of the following formulas. 

(A) First we need to express that the path vr starts with the edge with label in. 

(B) We need to express that eventually in a configuration, the machine reaches the final 
state Qf. 

(C) We also have to express that each passage of the path vr from the initial node of G^^ 

to the final node of G^^ in the graph H corresponds to the encoding of a (/c -|- 1, fon)- 
description. 

(D) We have to express that for all i <n, the trace of vr on the subgraph G^^ of the graph 
Ki corresponds to the /c-counter encoding i. 

(E) We need to express that the trace of the path vr on the subgraph G^^^ ^ of LC is the 
unique (fc -|- 1, /on, n + l)-description over the alphabet A^. 

(F) Finally we need to express how we move from one configuration of the tape to the next 
one. 


Cases (A) and (B) are straightforward. Cases (C), (D) and (E) are similar and we only give 
details for case (C) and case (F). By Lemma 021 case (C) means that 

if TTg^t is the subpath of vr corresponding to such a passage, then holds. 

(4.4) 

The node s is a node satisfying while t is the “closest” node to s with an incoming 
edge with label This is expressed by the following formula ^(s,t,7r) defined 

by 




(s) AR Efc^i(t-l,f) 

J fc+l,n 


[s < r <t A E E 
f) 


fe+i(r- l,r)]. 

fc+l,n 


(4.5) 
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It follows from the definitions of IF^{s,t,7r) and the graph that 

G*f+i,n ^ iff i® i'ff® node of a passage of vr through 

and t is the last node of that same passage. (4.6) 

Hence, (14.41) is equivalent to the fact that if ^(s, t, vr) holds, so does the formula 

oi'^s,t)- Therefore the formula 

[IF^{s,t,TT) (j)k+l,n,oi^s,t)] 

expresses case (C). 

Finally we treat the most difficult case which is case (F). We need to express how we 
move from one configuration of the tape to the next one. Recall that the trace of vr on the 
graph H will contain the encoding of the sequence G 1 G 2 ■ ■ ■ of the run, where GqGi ... is 
the full run of the machine on the input w. 

Let TTg^t be the subpath of vr corresponding to the encoding of a configuration Gt and 
let TTs',t' be the subpath of vr corresponding to the configuration Cj+i. We need to express 
how to move from the configuration Gt to the configuration Cj+i. Suppose that in the 
configuration Cj, the current state is q, and the head is scanning the cell c containing the 
letter u. Suppose also that S{q,u) = {q',v,R) (we can treat similarly the case where the 
head moves to the left). In order to keep our formulas simpler, we use a slightly different 
definition of a run of a Turing machine, but it would be clear that the notion of run that we 
use here, can be simulated by a usual Turing machine. Here, we assume that if the machine 
scans a cell c with content u and 6{q,u) = {q',v,R), then in the next state, the machine 
scans the successor c' of c, the content of F is v, while the content of c is u (in the usual 
definition, the content of c' is unchanged, while the content of c is u). 

Let TTr^s be the subpath of vr corresponding to the encoding of the cell c in the configu¬ 
ration Cj. Let TTr'^s' be the encoding of an arbitrary cell F in the configuration Cj+i. If F is 
the successor of the cell c, then the head should scan the cell F and the content of F should 
be the letter v. We express this by the formula change^^ ^ (,^(r, s, F, s', vr) defined by 

succk,ni'^s,r, A {q, u){s + 2) ^ {q , v){s' + 2). 

Recall that by Lemma 14.21 succk^n{'^r,s,'^r',s') is the formula expressing that the /c-counter 
associated with vr,./ 5 / is the successor of the /c-counter associated with vr^-^s- 

If F is not the successor of the cell c, then the head is not scanning the cell F and 
its content remains unchanged. If is the subpath of vr corresponding to the con¬ 
tent of the cell F in the configuration Cj, this is expressed by the following formula 
^^'^yfq,u,q',v) (a S) X, V/, /, s’, tt) defined by 

-^succk,n('^r,s, A (q, u)(s + 2) A eq;, „(7r^^y, TTr'^s') A (g", uo){x, y) -A ($, uq){s + 2) 

where g" € Q U {$}. Recall that by Lemma [4.21 eqi^ Tires') expresses that the k- 

counters associated with -Kx^y and Tir',s' the same. 

Now we need to express that the paths vrr,s, and 'Kx,y correspond to the encodings 
of fe-counters. By Lemma 14.21 this means that those paths correspond to passages of vr 
through the graph Similarly to (j4.5p . we introduce a formula /F^^(r, s, vr) defined by 

(« - 1, s) A [r <t<s A F; Efc {t - 1, t)]. 

This formula expresses that the path vr^^s corresponds to the encoding of a F-counter. 
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Next we also need a formula to assert that the paths 'Kr,s and Tir',s' appear in the 
encodings of successive configurations (and similarly, that the paths 'Kx,y and vr,./ 5 / appear 
in the encodings of successive configurations). Since the encoding of a configuration starts 
with the unique edge with label „ (and that edge only occurs at the beginning of the 
encoding of a configuration), this is equivalent to say that there is a unique edge between 
s and r' with label This is expressed by the formula config{s, r', tt) defined by 


[{s<t< /) A ik+i,n{t) A -3(t')’' [{s <t' < /) A ik+i,n{t') A {t' + t)]]. 
We are now ready to define ^ (tt) as the following formula 

Vr’", s’", (/)"", (s')’', x"", y"" s, tt) A IF^^^{r', s', tt) A IF^%{x, y, vr) 

Aconfig{s, r', vr) A config{y, r'vr) 
^ c/ianfireg ,^ ,j, „)(s,r,s',r',7r) A stoyg g, „)(r, s, x, y, r', s', vr)]. 


(4.7) 

(4.8) 

(4.9) 


It expresses the following. Suppose that TTr,s, '^r,s and TTr\s' are fe-counters encoding the 
numbers of three cells (this corresponds to (I4.7P '). Suppose that and -K^^y correspond to 
cells occurring in the same configuration C and that the cell corresponding to TTr'^s' occurs 
in the next configuration C (this is expressed by (14.811 b Then, if we “apply” the transition 
(5(g, u) = (g', u, R) to move from C to C", we move the head to the right and update the con¬ 
tent of the cell being scanned (as expressed by the formula change^^ ^ ^^(r, s, r', s', vr)) and 

we leave the other cells unchanged (as expressed by the formula stay^^ ^ (r, s, x, y, r', s', vr)) 

We define now the formula ^^(vr) as the formula 




This formula expresses how we move from one configuration to another, when the head 
moves to the right. Similarly, we can define a formula 9l{tt) expressing how we move from 
one configuration to another, when the head moves to the left. 

This finishes the proof of Theorem 14.11 Q 


As a corollary to the proof of Theorem 14.11 we obtain that data complexity is non¬ 
elementary even for simple WL formulas that talk about a single path in a graph database. 

Corollary 4.3. The evaluation problem for WL over graph databases is non-elementary in 
data complexity, even if restricted to Boolean WL formulas of the form Bnip, where ^|J uses 
no path quantification and contains no position variable of sort different than vr. 


5. Register Logic 

We saw in the previous section that WL is impractical due to its very high data complex¬ 
ity. In this section, we start by recalling the notion of regular expressions with memory 
(REM) and their basic results from [lOj . In our view, this logic is rather limited in terms of 
expressive power. For instance, the query (Q) from the introduction cannot be expressed 
in REM. We then introduce an extension of REM, called regular logic (RL), that reme¬ 
dies this limitation in expressive power (in fact, it can express many natural examples of 
queries expressible in WL, e.g., those given in [ 8 ]) while retaining elementary complexity of 
query evaluation. Finally, we study which fragments of RL are well-behaved for database 
applications. 
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5.1. Regular expressions with memory. REMs define pairs of nodes in data graphs that 
are linked by a path that satisfies a constraint in the way in which the topology interacts 
with the underlying data. REMs allow us to remember data values and use them later. 
Data values are stored in k registers ri,..., r^. At any point we can compare a data value 
with one previously stored in the registers. As an example, consider the REM \.r.a^\r^]. 
This can be read as follows: Store the current data value in register r (represented by the 
expression 4,r), and then check that after reading a word in a"*" we see the same data value 
again (condition [r^]). We formally define REM next. 

Let ri,..., Tfc be registers. The set of conditions c over {ri,..., r^} is recursively defined 
as: c := \ cAc \ -ic, for 1 < i < k. Assume that 'D± is the extension of the set "D of data 

values with a new symbol T. Satisfaction of conditions is defined with respect to a value 
d G D (the data value that is currently being scanned) and a tuple r = {di,... ,dk) G 
(the data values stored in the registers, assuming that dj = T represents the fact that 
register has no value assigned) as follows (Boolean combinations omitted): (d,r) |= 
iff d = dj. 

Definition 5.1 (REMs). The class of REMs over E and {ri,...,rfc} is defined by the 
grammar: 

e:=e|a|eUe|e-e|e'''| e[c] | \.f.e 

where a ranges over symbols in S, c over conditions over {ri,..., r^.}, and f over tuples of 
elements in {ri,..., r^}. □ 

That is, REM extends the class of regular expressions e - which is a popular mechanism 
for specifying topological properties of paths in graph databases (see, e.g., mm) - with 
expressions of the form e[c], for c a condition, and 4,f.e, for f a tuple of registers - that 
define how such topology interacts with the data. 

Semantics: To define the evaluation e(G) of an REM e over a data graph G = {V,E,k), 
we use a relation Mg that consists of tuples of the form (u. A, p, v, A'), for u, v nodes in V, 
p a path in G from u to u, and A, A' two fe-tuples over T)±. The intuition is the following: 
the tuple {u, X, p,v, X') belongs to [[ejc if and only if the data and topology of p can be 
parsed according to e, with A being the initial assignment of the registers, in such a way 
that the final assignment is X'. We then define e{G) as the pairs {u, v) of nodes in G such 
that {u, p, V, A) G [ejc, for some path p in G from n to u and /c-tuple A over 'D±. 

We inductively define relation [ejc below. We assume that Af=d, for d G D, is the tuple 
obtained from A by setting all registers in f to be d. Also, if pi = viaiV 2 ■ ■ ■ Vk-iOk-iVk and 
P 2 = VkakVk+i ■ ■ ■ Vn-ian-iVn are paths, then: 

Plp 2 •— V1CI1V2 * * * 

Then we define: 

• Hg = X,p,u,X) : u eV, p = u, X e V’l}. 

• Hg = {(m, a, p,v,X) : p = uav, X G V’^}. 

• M U e2lG = [eijc U Mjc- 

• M ■ e 2 jG = M\g o MIg, where feijc o MJg is the set of tuples (w. A, p, v, A') such that 
{u, A, pi,w, A") G [ciIg and {w, A", P2,v, A') G [e2]G) for some w £ V, /c-tuple A" over T>_i_, 
and paths pi,p2 such that p = pip2- 

• = Mg U (Hg o Hg) U ([cIg o [c^g o Hg) • • • 

• [eMlG = {(^^, A, p, V, A') G [cIg : (niv), A') |= c}. 
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• = {{u,X,p,v,\') : {u,\r=^(u),P,v,X') G [elc}- 

For each REM e, we will use the shorthand notation e* to denote e U e"*“. 

Example 5.1. The REM S* • -E* defines the pairs of nodes in data graphs that 

are linked by a path in which two nodes have the same data value. The REM 4.r.(a[-ir^])+ 
defines the pairs of nodes that are linked by a path p with label in a"*", such that the data 
value of the first node in the path is different from the data value of all other nodes in p. □ 

The problem Eval(REM) is, given a data graph G = (E, E, k), a pair (^1,^2) of nodes 
in V, and an REM e, is {vi,V 2 ) G e(G)? The data complexity of the problem refers again 
to the case when e is considered to be fixed. REMs are tractable in data complexity and 
have no worse combined complexity than EO over relational databases: 

Proposition 5.1 ([IQ]). FiVAhfREM) is PSPACE-complete, and NLOGSPACE-complete in 
data complexity. 


5.2. Register logic. REM is well-behaved in terms of the complexity of evaluation, but 
its expressive power is rather rudimentary for expressing several data/topology properties 
of interest in data graphs. As an example, the query (Q) from the introduction - which 
can be easily expressed in WL - cannot be expressed as an REM (we actually prove a 
stronger result later). The main shortcomings of REM in terms of its expressive power 
are its inability to (i) compare data values in different paths and (ii) express branching 
properties of the data. 

In this section, we propose register logic (RL) as a natural extension of REM that makes 
up for this lack of expressiveness. We borrow ideas from the logic CRPQ”', presented in [3j, 
that closes the class of regular path queries [7] under Boolean combinations and existential 
node and path quantification. In the case of RL we start with REMs and close them 
not only under Boolean combinations and node and path quantification ~ which allow to 
express arbitrary patterns over the data - but also under register assignment quantification 
- which permits comparing data values in different paths. We also prove that the combined 
complexity of the evaluation problem for RL is elementary (Expspace), and, thus, that in 
this regard RL is in stark contrast to WL. 

To define RL we assume the existence of countably infinite sets of node, path and 
register assignment variables. Node variables are denoted x,y,z,..., path variables are 
denoted vr, tt', tti, 7 r2 ,..., and register assignment variables are denoted u, ui, U 2 ,... 

Definition 5.2 (Register logic (RL)). We define the class of RL formulas (j) over alphabet 
E and {ri,..., r^} using the following grammar: 

atom := x = y | tt = | p = | p = T | (x, tt, y) | e{7T, ni, 1 ^ 2 ) 

(p := atom | \ pW cp \ 3xp \ Birp \ 

Here x, y are node variables, vr, tt' are path variables, n, v' are register assignment variables, 
and e is an REM over E and {ri,..., r^}. □ 
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Intuitively, v = 1. holds iff v is the empty register assignment, (x, vr, y) checks that vr 
is a path from x to y, and e{TT,i',u') checks that tt can be parsed according to e starting 
from register assignment u and finishing in register assignment u'. The quantifier 3v is to 
be read “there exists an assignment of data values in the data graph to the registers”. 

Let G = iy, E, k) be a data graph over S and (p a RL formula over S and {ri,..., r^.}. 
Assume that D is the set of data values that are mentioned in G, i.e., D = {k{v) : v £V}. 
An assignment a for p over G is a mapping that assigns (i) a node in V to each free node 
variable x in p, (ii) a path y in G to each free path variable vr in 4>, and (hi) a tuple A in 
to each register variable u that appears free in (p. That is, for safety reasons we assume 
that a{u) can only contain data values that appear in the underlying data graph. This 
represents no restriction for the expressiveness of the logic. 

We inductively define (G, a) \= p, for G a data graph, p an RL formula, and a an 
assignment for p over G, as follows (we omit equality atoms and Boolean combinations 
since they are standard): 

• (G, a) 1= = i iff a(z/) = T^. 

• (G, a) 1= (x,7 r,y) iff a(7r) is a path from a(x) to a(y) in G. 

• (G, a) 1= e{'K,u,u') iff {u,a{i>),a{'K),v,a{v')) € lejc, assuming Q;(7r) goes from node u to 

V. 

• (G, a) 1= 3xp iff there is node v £ V such that (G, a[x —u]) \= p- 

• (G, a) 1= Birp iff there is path y in G such that (G, Q:[7r —>■ y]) \= p- 

• (G, a) 1= 3vp iff there is tuple A in such that (G, a[v —>■ A]) |= p. 

Thus, each REM e is expressible in RL using the formula: 

[v = E f\ e(7r, p, v ')). 

Example 5.2. Recall query (Q) from the introduction: Find pairs of nodes x and y in a 
graph database, such that there is a node z and a path tt from x to y in which each node is 
connected to z. This query can be expressed in RL over S = {a} and a single register r as 
follows: 

Btt ( (x, TT, y) A 3z\/i'(ei(7r, n, v) —)• 3z'3tt' {{z',7^', z) A 62 ( 71 ', n, n )))) , 
where ei := a*[r^]-a* is the REM that checks whether the node (i.e. data) stored in register 
r appears in a path, and 62 := £{r^] ■ a* is the REM that checks if the first node of a path 
is the one that is stored in register r. 

In fact, this formula defines the pairs of nodes x and y such that there exists a path vr 
that goes from x to y and a node z for which the following holds: for every register value n 
(i.e., for every node n) such that 61 ( 71 , u, v) (i.e. node v is in tt), it is the case that there is 
a path tt' from some node z' to z such that 62 ( 71 ', v, u) (i.e., z' = v and tt' connects n to z). 
Notice that this uses the fact that the underlying data model is that of graph databases, in 
which each node is uniquely identified by its data value. □ 

The limitations in expressive power of RL have also been independently recognized by 
Libkin, Martens and Vrgoc |12] . In order to allow for interesting data value comparisons 
while retaining reasonable complexity of evaluation, they propose to use query languages 
based on the XML language XPath. These languages are not comparable in terms of 
expressive power to the ones we study here. 

Complexity of evaluation for RL: The evaluation problem for RL, denoted Eval(RL), 
is as follows: Given a data graph G, an RL formula p, and an assignment a for p over G, is 
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it the case that (G,a) \= </>? As before, we denote by Eval(RL,^) the evaluation problem 
for the fixed RL formula (/). 

We show next that, unlike WL, register logic RL can be evaluated in elementary time, 
and, actually, with only one exponential jump over the complexity of evaluation of REMs: 

Theorem 5.2. Eval {RLJ is ExPSPACE-complete. The lower bound holds even if the input 
is restricted to graph databases. 

Proof. We start by proving the upper bound, that is, Eval (RL) is in Expspace. The 
structure of the proof is quite similar to the one that proves that CRPQ^ queries can be 
evaluated in Pspace in combined complexity [1]. The difference is that now we have to 
accommodate the extra expressive power of RL, that allows to express properties of register 
values and check acceptance of data walks by REMs. 

Let G = {V, E, A) be a S-labeled data graph and 4> a RL formula over S and {xi,..., x^}. 
Let us denote hy D = {A(x) : x € R} and dehne D_\_ to be iA U {-L}. Further, let a be an 
assignment for (j) over G. We define 

• X = (xi,..., XfcJ as a tuple of nodes in G such that 

{xi,..., Xfcj} = {a{x) : X is a free node variable}, 

• p = {pi,..., pk 2 ) as a tuple of paths in G such that 

{pi,..., pk 2 } = {o(p) : p is a free path variable}, 

• A = (Ai,...,Afe 3 ) as a tuple of register values for {xi,...,Xfc} over G (i.e., tuples in 
{D U {_L})^) such that {Ai,..., A^g} = {a(A) : p is a free register variable}. 

Further, assume that ei,..., Cm are all the REMs mentioned in (f. Our goal is to define an 
Expspace procedure that checks whether (G, a) \= </>. In order to do that, we first have to 
introduce some new terminology. 

Let r be a first-order (FO) vocabulary 

(Nodes, Paths, Registers, Endpoints, ei,..., Cm, i), 

where (a) Nodes, Paths and Registers are unary relation symbols, (b) Endpoints and Cj 
(1 < z < m) are ternary relation symbols, and (c) i is a constant. We define, from G, an 
FO structure M.g over r as follows: The domain of M.g is the disjoint union of V, all the 
paths that belong to G, and all A:-tuples over D±. (Notice that each node in V is also a 
path in G, but here we consider them to be different objects. That is, each v G V appears 
separately as a node and as a path in the domain of Mg)- The constant i is interpreted 
in Mg as the tuple _L^. The interpretation of Nodes in Mg contains all those elements of 
the domain that are nodes. The interpretation of Paths in Mg contains all those elements 
of the domain that are paths. The interpretation of Registers in Mg contains all those 
elements of the domain that are A:-tuples over • The interpretation of the ternary relation 
Endpoints contains all tuples (x,p, x') such that p is a path in G from node x to node v'. 
Finally, the interpretation of the symbol (1 < z < m) contains all tuples (A, p. A') such 
that p is a path in G, A, A' are Z-tuples over D±, and ej(p. A, A'). 

Let (fr be the FO formula over vocabulary r obtained from (j) by simultaneously replac¬ 
ing (1) each subformula of the form 3x6 (for x a node variable) with 3x(Nodes(x) A 9), (2) 
each subformula of the form BttO (for tt a path variable) with 37r(Paths(7r) A 6), (3) each 
subformula of the form (for u a register variable) with 3u(Registers(u) A 9), and (4) 
each atomic formula of the form (x,7 r,z/) with Endpoints(x, tt, y). 
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Clearly, G, a 1= (/> iff Mg, a\= 4>r- 

Of course, Mg cannot be effectively constructed from G since the set of paths in G 
is potentially infinite, and, thus. Mg is also potentially infinite. However, it is possible to 
prove that there exists a finite structure M'q p such that G, a 1= (/> iff MQp,a 1= (pr- We 
show how to define M'q p next. 

Assume that the quantifier rank of (f>r is k > 0, where the quantifier rank of an FO 
formula 0 is the depth of nested quantification in 0. Let £1 C {ei,... ,eni}xD±^D±- A path 
p in G satisfies S if the following holds; For each triple (e*. A, A') G {ei,..., Cm} x x D\, 
it is the case that G,a \= efip^ A, A') iff (e*. A, A') G E. (Notice that for each path in G there is 
one, and only one, subset £ of {ei,..., Cm} x D\ x D\ that it satisfies.) For each pair (u, v') 
of nodes in V, and for every £ C {ei,..., Cm} x x let > 0 be the minimum 

between k + \p\ and the number of paths in G that go from v to v' and satisfy £. We 
arbitrarily pick, for each pair {v, v') of nodes in V and for each £ C {ei,..., Cm} x x D^, 
ce,v,v' distinct paths „„',•••, Psv ’v' to v' that satisfy £. 

We define the structure M'q p as follows: Its domain contains all the nodes of V, each 
path p that belongs to the tuple p, every path of the form p'^ where £ C {ei,..., Sm} x 
D'l X {v,v' G V and 1 < i < and every tuple in D^. The constant T is 

interpreted in M'q p as the tuple T^. The interpretation of Nodes in M'q p contains all 
nodes in the domain. The interpretation of Paths in M'q p contains all those elements of 
the domain that are paths. The interpretation of Registers in Mg,p contains all those 
elements of the domain that are /c-tuples over D^. The interpretation of the ternary relation 
Endpoints contain all tuples of the form {v,p,v'), where v,v' (zV and p is a path in the 
domain that goes from v to v' in G. Finally, the interpretation of (1 < i < m) in M'q p 
contains all tuples (A, p, A') such that p is a path in the domain, A, \' are A:-tuples over iA_i_, 
and efip, A, A'). 

By using a standard Ehrenfeucht-Fra'isse argument it is possible to prove the following: 

Claim 5.2.1. The structures {Mg,v, p, X) and {M'q p,v, p, X) are indistinguishable by FO 
sentences of quantifier rank < k. 

Proof. We show that the duplicator has a winning strategy in the /c-round Ehrenfeucht- 
Fra'isse game played on {Mg,v, p,X) and {M'q p,v, p,X). The duplicator’s response to a 
spoiler move in round i < k is (inductively) defined as follows (we assume without loss of 
generality that the spoiler never repeats moves, i.e. in no round does the spoiler choose an 
element that has already been chosen by either player in previous rounds): 

• If the spoiler’s move in round i is a node in either of the two structures, then the duplicator 
responds by mimicking the spoiler’s move on the other structure; 

• if the spoiler’s move in round i is a /c-tuple over D U {T} in either of the two structures, 
then the duplicator responds by mimicking the spoiler’s move on the other structure; 

• if the spoiler’s move in round i is a path p in p in either of the two structures, then again 
the duplicator responds by mimicking the spoiler’s move on the other structure; 

• if the spoiler plays a path p from node v to v', in either of the two structures, such that 

p satisfies £ C {ei,... ,6^} x x and p is not a path in p, then the duplicator 

responds with any path from v to v' in the other structure that (1) satisfies T, (2) does 
not belong to p, and (3) has not been previously chosen in the game. Notice that it is 
always possible for the duplicator to choose such a path, since for each pair of nodes 
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v,v' G V and for each £ C {ei,... ,em} x D± x D^, the number of paths from v to v' 
that satisfy £ and do not belong to p is the same up to k. 

It is easy to see that duplicator’s response defined in this way always preserves a partial iso¬ 
morphism between the two structures. This implies that the duplicator has a winning strat¬ 
egy in the fc-round Ehrenfeucht-Fra'isse game played on (A^g, u, p, A) and {A4'q p,v, p,X), 
and, thus, by well-known results, that the structures are indistinguishable by FO sentences 
of quantifier rank < k. D 

The previous claim shows that (G, a) 1= 0 iff {Mq^^o) 1= (fir- Thus, a straightforward 
approach to check whether (G, a) 1= (/> would be to construct M'q ^ and then evaluate (pr 
over it. The problem with this approach is that -Mq ^ could be of double exponential size 
(because there is a double exponential number of different subsets £ of {ei,..., Cm} x x 
D\), and, thus, impossible to construct in exponential space. It will be necessary to follow 
a different approach. 

Assume that 4>r is given in prenex normal form, i.e. 4>r is of the form 

Qiyi ■ ■ ■ QmVm 2/1, • • • , 2/m), 

where each Qi is either 3 or V, each yi is a node, path or register variable, and i// is 
quantiher-free (if 0,- is not in prenex normal form, we can convert it in polynomial time 
into an equivalent formula in prenex normal form). We follow a usual argument to evaluate 
FO formulas on structures. The main problem with this is that some of the elements in 
M'q p are paths and register values, and have to be treated as such. Therefore, we define a 
way of encoding paths (in exponential space) and register values (in polynomial space). 

Clearly, each register value can be codified with a tuple of length k ■ log 2 (|F|). In order 
to denote that this tuple is the address of a register value (and not, say, of a path), we add 
an extra bit at the beginning of the tuple which is labeled with a new symbol r. Codification 
of paths requires a bit of extra work. Each path p is encoded with an address, that is, a 
string that satisfies the following: 

• It starts with a new symbol p, that states that this is the address of a path; 

• the address continues with the encodings of the two endpoints v and v' of the path 
(separated with some delimiter); this part of the address uses 0 (log 2 |F|) space; 

• then the address encodes the subset £ of {ei,..., e^} x x that p satishes; this 
encoding can be easily expressed with a string of length m x \V\^ x \V\^ over alphabet 
{0,1} that flags with a I those elements of {ei,..., Cm} x x that belong to £] 

• hnally, the address contains an encoding of the integer i < k + \p\ such that p = p^y^r, 
this encoding uses 0 (log 2 (|<)>| -k |/>|)) space. 

Clearly, the address of a path dehned in this way can be specified using at most exponential 
space. 

We show next how the problem of checking whether (v, p, A) belongs to the evaluation of 
(f>r over M'q p can be solved in exponential time by an alternating Turing machine. This will 
hnish the proof of the theorem, since the class of problems that can be solved in exponential 
time by alternating Turing machines coincides with the class of problems that can be solved 
in Expspace. 

The alternating machine proceeds as follows. It hrst replaces in (/>t- each node variable x 
in X with the encoding of the corresponding node v of v, each path variable vr in yf with the 
encoding (address) of the corresponding path p in p, and each register variable u in u with 
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the encoding of the corresponding tuple A in A. Then the machine reads the formula (pr from 
left-to-right. Each time it encounters an existential quantifier 3yi it enters an existential 
state, and each time it encounters a universal quantifier Vi/j it enters a universal state. In 
each case, the machine “guesses” the interpretation of yi as the encoding of a node, a path 
or a register value c{yi) in the domain. (Since encodings of paths are of exponential size, 
this alternating machine requires at least exponential time to work). Finally, the machine 
verifies that ip{v, p,X,c{yi),... ,c{ym)) holds, and if that is the case it accepts. We show 
next that the latter can be done in exponential time. Notice that this implies that the 
whole process can be performed in exponential time. 

We start with the case of the atomic formulas in ■)/;. In order to check whether the 
element assigned to a variable belongs to the interpretation of Nodes in A4'q p, we only have 
to check that the encoding of this element does not start with a p or an r. In order to 
check whether the element belongs to the interpretation of Paths (resp.. Registers), it is 
sufficient to check that its encoding starts with a p (resp., r). In order to check whether 
the elements a, b, c assigned to variables x, n, y, respectively, are such that (a, b, c) belongs 
to the interpretation of Endpoints, we only have to check that b is the encoding of a path, 
a and c are encodings of nodes, and that 6 is a path from a to c. Finally, in order to 
check whether the elements (a, b, c) assigned to a variable belongs to the interpretation of 
Ci (1 < i < m), we only have to check that a, c are register values (i.e. their encodings 
start with symbol r), that b encodes a path p (i.e. its encoding starts with p), and that 
the bit that corresponds to tuple (ei,a, c) in the part of the address b that encodes the set 
£ C {ei,..., Cm} X X that p satisfies is set to 1. 

Thus, the value of the atomic formulas involved in p,X,c{yi),... ,c{ym)) can be 
computed in polynomial time (in the size of V’(h, p, A, c(yi),... ,c{ym)))- But since V' is a 
polysize Boolean combination of atomic formulas, the value of a{v, p,X,c{yi),... ,c{ym)) 
can be computed in polynomial time from the values of the atomic formulas. We conclude 
that computing the value of a{v, p, A, c{yi ),..., c{ym)) can be done in polynomial time. 

There is, however, one small issue that requires explanation in order for the previous 
procedure to work properly. Assume that the procedure “guesses” the interpretation of 
a variable yi in (pT to be the encoding of a path in G from v to v' that satisfies £ C 
{ei,... ,em} X X D^. Then it is necessary to check that, if the encoding implies that 
this path is p^y^i, then i < In order to do so, the procedure needs to check, in a 

subroutine, whether there exist i different paths from v to v' that satisfy £. The next claim 
shows that this can be done in exponential space, which finishes the proof of the theorem. 

Claim 5.2.2. For each pair v,v' £V, £ C {ei,... ,em} x x D^, and i < k + \p\, one 
can check in Expspace whether there are i distinct paths in G from v to v' that satisfy £. 

Proof. Let ^ be a symbol not in S and denote by the alphabet SU{#}. Let Avy be the 
automaton over alphabet {u} U (S^ x V) dehned as follows. The set of states is the disjoint 
union of V with a new state s. The initial state of ^ is s and the final state is v'. Further, 
the transition relation of A is dehned as follows: (1) For every edge (vi, 0 , 02 ) G P there is a 
transition in A from vi to V 2 labeled ( 0 ,^ 2 ), (2) for every node G F there is a transition 
from vi to vi in A labeled (^,ui), and (3) there is a transition in A from s to u labeled 
V. Intuitively, Avy accepts exactly those strings of the form v{ai,vi){ 02 , 02 ) • • • {a,v') such 
that vaiVia 2 V 2 ■ ■ ■ av' is a path in G from v to v', when we allow paths to loop arbitrarily 
many times on ^-labeled nodes. 
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Let be the automaton over alphabet {u*} U x Vy defined as follows: The set 

of states is U {s*}, the initial state is s* and the hnal state is {v'Y. There is a transition in 
A'' from u = (ui ,... ,Ui) to w = (wi,..., Wi) labeled i = (ti,..., tj) iff there is a transition 
labeled ti from U£ to W£ in Av^v>, for each 1 < i < i. (Notice that A^, is not exactly the i-th 
product of A.u^v> with itself, as does not contain all states in such a product). Clearly, 
A\j is of exponential size but the size of each one of its states is polynomial. Furthermore, 
it is decidable in polynomial time whether there exists a transition labeled i from state u 
to w in Aly. 

Dehne now an automaton that is the restriction of A^ to those strings 




over alphabet {u*} U xVy that satisfy the following: 

• For each 1 < £ < i, if for some t < j < p it is the case that = (^, v'), for some v' € V, 
then for each j < A: < p it is the case that = {#, v'). 

• For each I < i,t < i, li i A t then the strings vwj ■ ■ ■ and vw} • ■ ■ ref over alphabet 
{u} U X V) are different. 

The hrst condition says that each projection of a string accepted by A'^ represents a path 
in G from v to v' that loops only on v' and only at the end of the path. The second 
condition ensures that any two distinct projections of a path accepted by A',^ represent 
different strings. 

It is not hard to prove that the language accepted by A!.^ is nonempty iff there exist i 
distinct paths in G from v to v'. Further, it is also not hard to see that A'.^ is of exponential 
size but the size of each one of its states is polynomial; and it is decidable in polynomial 
time whether there exists a transition labeled t from state q to state q' in A!.^ . 

Using techniques in m, it is also possible to construct in exponential time an NFA 
Avy,ei,\,\' {si €{!,..., m}, A, A' G D^) over alphabet v U x U), that accepts precisely 
the strings w accepted by A^y such that the path p from v to v' in G that is represented by 
w satishes ei{p, A, A'). (The main idea is to construct Avy^ei,\,x' in such a way that, at each 
position while reading w, it keeps in its state the /c-tuple of data values that is stored in the 
registers of e*). The set of states of Avy^ei,\,\' i® exponential size, but each particular 
state can be represented using polynomial space. Further, deciding if there is a transition 
between two states of A.v,v',ei,x,x' can be done in polynomial time. This means that for each 
Av,v',ei,x,x', its complement can be constructed in double exponential time, with each state 
using only exponential space. 

It is not hard to see, then, that one can construct in double exponential time an 
automaton As^^y over alphabet {u®} U (S^ x U)® that does the following: It starts from 
A'.^ and restricts acceptance to strings ..., wj) ■ ■ ■ (ref,..., ref) over alphabet {e®}U 

(S^ X Vy such that: 

• For each 1 < j < i and tuple (e^. A, A') € E, it is the case that urej • • -w^ is accepted by 
A^y ,e(,x,x' • 

• For each 1 < j < i and tuple (e^. A, A') 0 E, it is the case that erej ■ ■ ■'uA- is accepted by 
the complement A^y^e^yx' ■ 
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Further, each state in As^vy can be represented using exponential space and checking 
whether there is a transition between two given states of Ag^^y can be done in polynomial 
time. 

It is clear that there exist i distinct paths in G from v to v' that satisfy £ if and only if 
As^vy accepts at least one string. But we can check As^^y for nonemptiness in Expspace 
using a standard “on-the-fly” argument. This finishes the proof of the claim. Q 

This also finishes the proof that 'Eval(RL) is in Expspace. Now we show that 
Eval (RL) is ExpSPACE-hard. 

For all constants /o, we provide a reduction from the class of problems solvable by a 
Turing machine using a tape of size 2^°^ given an input word of size n. There are a Turing 
machine M and a constant /o such that the following problem is ExpSPACE-hard: given a 
word w of size n, is there an accepting run of M over w using at most 2-^°^ cells? We prove 
that there is a formula (p € RL such that for all words w of size n, there are a formula cpw 
and a graph such that 

Gw (pw iff there is an accepting run of M over w using at most 2^°"' cells. 

Let {T,,Q,6,Qo,Qf) be the Turing machine M, where S is the alphabet consisting of the 
input alphabet and the blank symbol B, qq is the initial state, (?/ is the final state and 
6:Qx'E,^{QxT,x {L, R}) is the transition map, where L stands for “left” and R stands 
for “right”. 

The formula (pw that we associate with the machine M and a word u) is a formula of 
the form 

37rV’«,(vr), 

where ipw is a formula that does not contain any quantification over path variables. The 
formula ipwi'^) expresses that the path vr in the graph encodes an accepting run of M 
over the word w. 

As in the proof of Theorem sn we encode a configuration C of a run of M in the 
following way. Suppose that the content of the tape is the word w' = w[ ... the head 

is scanning the cell number io and the machine is in state q. We encode the configuration G 
by the word ec = where & plays the role of a delimiter and each df encodes 

the information in cell number i. More precisely, if A is the alphabet {Q U {$}) x S, we 
define df as the word 

c(i) (qfwi), 

where c{i) and are defined as follows. The word c{i) is the binary representation of the 
number i. The letter w'^ is the content of the cell i. The letter q[ is equal to $ if the head 
is not scanning the cell number i; otherwise, is equal to the state q. That is, qf = q and 
for all i ^ io, q[ = $. We encode a run GqGi ... as the word ecQppec'iiP ■ ■ ■ ■ We define the 
formula '!/’«, (vr) and the graph Gy^, in such a way that a path tt satisfies ipw iff the label of vr 
is the encoding of an accepting run of M over w. 

The graph Gy, is (almost) the same graph as the graph in the proof of Theorem 14.11 
in the case where k = 1. That is, Gy, is obtained by “linking” two graphs ly, and H. If 
CqCi ... is an accepting run with associated path tt, the trace of tt on ly, will correspond 
to the encoding of the initial configuration Gq, while the trace of vr on FI is the encoding of 
the run G 1 G 2 .... The graph H is given by 


EXPRESSIVE PATH QUERIES ON GRAPHS WITH DATA 


27 


di 



Recall that the set {di : 1 < i < ^} is dehned as {Q U {$}) x S. Consider a simple path 
tt' starting from the node with data value 1 and ending in the node with outgoing edges 
with labels &: and Its label is of the form c{i){q',a); that is, it is the encoding of the 
information in a cell with number i. Hence, the label of a path tt starting and ending in the 
node with data value 1 satishes c{ii)di{k V #)c{i 2 )d 2 {^ V #)... c{ik)dk{^ V #), where each 
dj is the encoding of the information of a cell and each c(ij) is the encoding of the number 
ij. We define the formula V’ in such a way that if holds, the succession of encodings 
of cells describe a run of M. 

Next we define the graph where we encode the initial configuration. Suppose that 
w = wi... Wn- For all 1 < i < n, we introduce a graph Ki describing the cell number i in 
the initial configuration. If 6i... is the binary encoding of the number i, the graph Ki is 
given by 






iQi,Wi) 

■o-► 


o 


where q[ = qo and q'^ = % i ^ 1. The label of the longest path of Ki is exactly the encoding 
of cell number i in the initial configuration. Next we define the graph K which allows us 
to encode the cells with number > n in the initial configuration. The graph K is given by 


1 1 1 



The label of a simple path from the node with data value 1 to the node with outgoing edge 
with label & is of the form c(i)($, B); that is, it is the encoding of an unscanned cell with 
a blank symbol. In particular, it is the encoding of all the cells with number > n in the 
initial configuration. The graph is obtained by linking together the graph Ki ,..., Kn, K 
in the following way 
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The arrow with a label s is an arrow pointing to the node of Ki with data value 1. Each 
arrow with label # between two graphs is from the “right-most” node of the first graph to 
the “left-most” node of the second graph. Finally, the graph is dehned by 



where the edge with label is an edge from the “right-most” node of K to the node with 
data value 1 in H. We dehne now a formula V’tt such that 

Guj \= iA(7r) iff 

/(tt) is the encoding of an accepting run of M over w using at most 2^°"' cells, 
where l{Tr) is the label of tt. In fact it is easier to define a formula x(^) such that 
Gw 1 = x(7r) iff 

/(tt) is not the encoding of an accepting run of M over w using at most 2-^°” cells. (5.1) 

Suppose that vr is a path through Gw- The label of vr is not the encoding of an accepting 
run of M over w using at most 2-^°” cells iff at least one of the following conditions holds. 

(i) The first letter of /(vr) is not the initial letter s or the run never reaches a final state, i.e. 
there is no pair of the form (<?/, a) for some a, occurring in /(vr). 

(ii) The symbol ^ is not at the “right place”: 

— either after we reach the symbol # (i.e. we are going to enter the encoding of a 
new configuration), the label contains the binary encoding of a number 7 ^ 1 , 

— or after the binary encoding of the number 2^°'^ (that is, after encoding the infor¬ 
mation of the last cell), the symbol ^ does not occur in the label of vr. Since ^ is 
used as a delimiter between encodings of configurations, this means that although 
we finished encoding the last cell of a configuration, we do not move to a new 
configuration. 

(hi) There is a substring c{i)dk.c{j)d' (where i < 2”) such that j is not the successor of 
i. That is, after encoding the information of cell number i, we do not encode the 
information of cell number i + 1. 

(iv) Finally, there is a string ec#ec/ of D.,^ such that G and C are not successive config¬ 
urations. 

Expressing cases (i) and (ii) is fairly easy. We concentrate on (hi) and (iv). We start by 
showing how to express case (iv). Suppose that c(i) and c(j) are two successive binary 
encodings occurring in the label of vr. Suppose c(i) = 61 ... bf^n and c(j) = b[ ... b'j^^. Then 
j is not the successor if i iff one of the following holds. 

(a) For some k, b^ ■ ■ ■ bf^n is equal to 1... 1 and 6 ). is not equal to 0. 

(b) For some k, bk ■ ■ ■ bf^n is equal to 01... 1 and 6 ^ is not equal to 1. 

(c) For some k, 6 ^ = 1, 0 occurs in b^+i ■ ■ ■ bj^n and 6 ^ is not equal to 1. 

(d) For some k, 6 ^ = 0, 0 occurs in b^+i ■ ■ ■ bf^n and 6 ^ is not equal to 0. 
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We show how we can express (a); the other cases are similar. Case (a) is expressed by the 
following REM 

{A\Af)* i ri.l*A&{0, l}*[rf]lA*. 

where Aj is the set {q/} x E and A is the alphabet of the graph Gw In the register ri, we 
store the number k such that bk ■ ■ - hf^n = 1... 1 (that is, we only go through edges with 
label 1 until we reach an edge with label in A). When we reach again the node with data 
value k, the label of the outgoing edge is 1, expressing that 6'^ = 1. 

Finally we look at case (iv), i.e. how to express that there is a string ec'#ec' of such 
that C and C are not successive configurations. This might happen for several reasons: (A) 
either we did not modify properly the content of a cell or move properly the head or (B) 
we modified the content that was supposed to remain constant. We only treat case (A), as 
the other case can be handled in a similar way (note that the proof of Theorem 15.31 is very 
similar to this proof and there, we will treat case (B)). In case (A), we also only consider 
the case of a transition moving the head to the right, the other case being symmetric. 

So suppose that 6{q,a) = {q',b,R). As in the proof of Theorem 14.11 we use a slightly 
different dehnition of a run of a Turing machine (which is equivalent to the usual definition, 
but helps us to keep our formulas simpler). We assume that if 6{q,a) = {q',b,R) and the 
machine scans a cell c with content a, then in the next state, the machine scans the successor 
c' of c, the content of d is 6, while the content of c is o. 

Let (A;^)* be the set of words over A that contain at most one occurrence of #. We 
define as the following REM 

A*{0,1} i n.... {0,1} i r,.((?,a)(A!#)*{0, Ijirf] ... {0, l}[r%JAk{0, 1}* (A\{(g', 6)})AT 

We store in the registers ri,... ,rf^^n the binary encoding of a number i. That number is 
the number of a scanned cell with content a and the current state is q. After the next 
occurrence of ^ (after reading a word in (A;^)*), we enter a new configuration. In that new 
configuration, we reach the cell number i when we read a sequence matching the contents of 
the registers. The encoding of the next cell (after reading a sequence in A&:) must consist 
of the binary encoding of a number followed by a symbol in A that is not {q', b). 

If Ar is the set {((?, a) : 5{q, a) = {q', b, R) for some q', b} and A is a register of size /on, 
we define the formula 

taking care of case (iv)(A). □ 

The increase in expressiveness of RL over REM has an important cost in data complex¬ 
ity, which becomes intractable: 

Theorem 5.3. 'Eval (RL) is in Pspace in data complexity. Furthermore, there is a finite 
alphabet E and a RL formula (p over E and a single register r, such that Eyal (RL,(p) is 
PSPACE-hard. In addition, the latter holds even if the input is restricted to graph databases. 

Proof. The upper bound follows as a corollary to the proof of the upper bound in Theorem 
15.21 In fact, it is clear that the whole process can be carried in Pspace if we assume a fixed 
RL query (in fact, to obtain a Pspace upper bound we do not need more than to fix the 
number of registers used in the query). 

For the lower bound, we define a formula 4> in AT such that for all constants /o, there 
is a reduction from the class of problems solvable by a Turing machine using a tape of size 
/on given an input word of size n, to the evaluation problem of p. More precisely, there are 
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a Turing machine M and a constant /o such that the following problem is PSPACE)-hard: 
given a word w of size n, is there an accepting run of M over w using at most /on cells? 

We prove that the formula (f) is such that for all words w of size n, there is a graph 
such that 

Gw 1= </ iff there is an accepting run of M over w using at most /on cells. 

Let (S, Q, 5, qi,qf) be the Turing machine M, where S is the input alphabet together with a 
blank symbol B, qQ is the initial state, qf is the hnal state and 5 : Q xTi ^ {Q xT, x {L, i?}) 
is the transition map, where L stands for “left” and R stands for “right”. 

The formula (p that we associate with the machine M is a formula of the form 

3711/(71), 

where 7 / is a formula that does not contain any quantification over path variables. Given a 
word tc, the path tt in the graph Gw will encode an accepting run of M over the word w. 

Given a word w of size n, consider a configuration G of the run of M over w where 
the contents of the tape is the word w' = w'^.. . the head is scanning the cell number 
io and the machine is in state q. Similarly to the proof of Theorem 14.11 we encode the 
configuration G by the word 

ec = Nd^Ndl... Nd%^ 

where each df encodes the information in cell number i in the conhguration C. We dehne 
df as the pair {q[,w'j), where q[ is dehned as follows. The letter w[ is the contents of the 
cell i. The letter g' is equal to $ if the head is not scanning the cell number i\ otherwise, q[ 
is equal to the state q. That is, q[^ = q and for all i ^ ig, q[ = $. 

The run of M over the word w is a (possibly inhnite) sequence of configurations of the 
form GqGi .... We encode the run as the word eco#eci# • • •; where # plays the role of 
a delimiter. We will dehne the formula 7/(71) and the graph Gw in such a way that a path 
TT satishes 7/ iff the label of tt is the encoding (as dehned above) of an accepting run of M 
over w. 

We think of a path tt encoding a run of M over w as consisting of two parts. The label 
of the hrst part contains the encoding ecg of the initial conhguration Cq. The label of the 
second part contains the encoding ... of the remaining part of the run. The hrst 

part of the path tt is a path in a subgraph Iw of Gw, while the second part is a path in the 
subgraph H (independent of w) of Gw- The graph Gw will be obtained by adding an edge 
from a node in Iw to a node in H. 

The graph Iw is given by 



In the graph above the data value i carried by a node v indicates that the label of the 
outgoing edge of v is df° (where Gq is the initial conhguration and df° is dehned as above). 
Recall that df° indicates the contents of the cell number i and whether or not the head is 
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scanning that cell. There is a unique path tto from the node with no incoming edge to the 
unique node with no outgoing edge. The label of tto is 

sN{qi, Wo) ($, u;i)iV($, W2) ■ ■ ■ N{$, Wn)N{$, B)... N{$, B), 

that is, the word s.eco, where ecg is the encoding of the initial configuration Cq. 

We define now the graph H encoding the remaining part of the run of the machine. 



For all 1 < f < /on, all g G Q and all a G S, the node with data value i admits outgoing 
edges with label (g, a) and ($,a). A path from the “left-most” node to the “right-most” 
node that does not go through the edge with label 7 / has a label of the form 

NdiNd2...Ndf,nN, 

where each di belongs to {Q U {$}) x S, that is, an encoding of a configuration of the 
machine. 

Hence, a path tt' from the “left-most” node to the “right-most” node of H (possibly 
going through the edge with label /t) has a label of the form 

eci#ec2# • • • ecn 

where each eci is the encoding of a configuration of the machine. 

We are now ready to define the graph Gw 



The edge with label # is an edge from the unique node in 1^ with no outgoing edge 
to the “left-most” node in H. We define now the formula i/- Let A be the alphabet 
[{Q U {$}) X S] U {#, s}. The formula ip must be such that 

Gw L iff K'^) is of the form secaiP^Ci • • • oCpA*, 
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where /(tt) is the label of tt and Co ... Cp is an accepting run of the machine over w. In fact, 
it will be intuitively easier to first define a formula x(^) such that 

Gw 1 = x('^) iff is not of the form sec'o#eci ■ ■ ■ eCpA*, (5.2) 

and define iIj as -ly. The formula x is obtained as a disjunction of the following subformulas. 

• First 1{tt) might not satisfy seco#ec'i ...ecpA* because: (i) it does not start with the 
letter s or (ii) it does not contain the encoding of a final configuration, i.e. it does not 
contain any occurrence of a pair of the form {qf,a) for some a E S. Case (i) is expressed 
by the formula 

V{e 7 (vr,T,T) : 7 E A,7 7 ^ s}, 

where is the REM 7 A*. Case (ii) is expressed by the formula 

(A\A^)*)(7r,T,T), 

where Aj is the set {q/} x S. This formula express that there is no pair of the form 
(g/, a), for some a E S. 

• Next /(tt) might not be of the “right form” because it contains a substring of the form 
ec#ec' occurring before a pair of the form (g/,a) and such that C and C are not 
consecutive configurations. This might happen for several reasons: (A) either we did not 
modify properly the contents of a cell or move properly the head or (B) we modified the 
contents that was supposed to remain constant. 

We will only treat one case. As we treated case (A) in the proof of Theorem 15.21 here 
we treat case (B). As in the proof of Theorem 15.21 we also consider a slightly different 
definition of a run of a Turing machine (which is equivalent to the usual definition, but 
helps us to keep our formulas simpler). We assume that if 6{q,a) = {q',b,R) and the 

machine scans a cell c with content a, then in the next state, the machine scans the 

successor c' of c, the content of c' is b, while the content of c is a. 

In case (B), we make the following case distinction. 

— Suppose first that case (B) happened because we modified the contents of the cell that 
was scanned by the head in configuration C. Note that where when moving from C 
to C, by definition of a run, we cannot modify the contents of the cell scanned in 
conhguration C. 

Let (g, a) be a pair in Q x S and let A be the set {Q U {$} x S). We define also A*^ 

as the set of words over A that contain at most one occurrence of the symbol We 

let (^) be the formula 

{A\Af)* ir.{q,a)A*^[r=]{A\{{q,a) : q E Q U {$}})AA 

Before we reach a final state, we store in register r the number of a cell that is scanned in 
the current conhguration and with contents a. In the next conhguration (after reading 
a word in A*^), when we read the same cell, it does not contain a. That is, the label 
of the edge is not a pair of the form (g, a). 

The formula 

A,u) : {q,a) € Q xT.}, 

takes care of the cases where from moving from one conhguration to the next, we 
modihed the contents of the cell scanned in the hrst conhguration. 

— Next suppose that when moving from C to C, we modihed the contents of a cell 
that was not scanned in C. Suppose also that according to the machine, we were not 
supposed to modihed that contents. 
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Let a be a letter in S. Let be the set of pairs {q,a) such that 6{q,a) = {q',b,L) 
and let Ar be the set of pairs {q,a) such that 6{q,a) = {q',b,R). We define 6 ( 1 ( 71 ) as 
the REM 

{A\Afr{A\AR)ir.{$,a)N{A\AL)At^[r=]{A\{$,a))A*. 

Before we reach a final state, in a configuration C, we store in register r the number i 
of an unscanned cell with contents a. We assume that if the cell with number (i — 1) (if 
its exists) is scanned in C, then the head is not moving to the right (i.e. to cell number 
i) in the next configuration. That is, the pair describing the cell number (i — 1) in C 
is not a pair in Ar. We also assume that if cell number (i + 1) (if it exists) is scanned, 
then the head is not moving to cell number i in the next configuration, i.e. cell number 
{i + 1 ) is not described by a pair in A^. 

Next we express that in the next configuration, it is not the case that the cell with 
number i is an unscanned cell with contents a. That is, after reading a word in Aj(^, 
when we see again the node stored in the register, the label of the edge is not the pair 
($,a). 

The formula 

\/{3u e(j( 7 r,T,i/) : a G A}, 

takes care of case (B) in the case where the cell modified is a cell unscanned by the 
head. 

This finishes the proof of the theorem. Q 

In the next section we introduce an interesting language, based on a restriction of RL, 
that is tractable in data complexity, and thus better suited for database applications. This 
language is a proper extension of REM. But before that, we make some important remarks 
about the expressive power of RL. 

Expressive power of RL. We now look at the expressive power of the logic RL. It was proven 
in [ 8 ] that CRPQ is not subsumed by WL. Since RL subsumes CRPQ”', it follows that RL is 
not subsumed by WL. On the other hand, WL is also not subsumed by RL due to Theorem 
an Theorem 15.21 and the standard time/space hierarchy theorem from complexity theory. 
Therefore, we have the following proposition: 

Proposition 5.4. The expressive power of WL and RL are incomparable. 

On the other hand, we shall argue now that many natural queries about the interaction 
between data and topology are also expressible in RL. The aforementioned query (Q) is 
one such example. We shall now mention other examples: hamiltonicity (H), the existence 
of an eulerian trail (E), bipartiteness (B), and connected graphs with an even number of 
nodes (C2). The first two are expressible in WL, while (B) and (C2) are not known to be 
expressible in WL. We conjecture that they are not. 

We now show how to express in RL the existence of a hamiltonian path in a graph; the 
query (E) can be expressed in the same way but with two registers (to remember edges, i.e., 
consisting of two nodes). This is done with the following formula over E = {a} and a single 
register r: 

dvr (VAVA'-ei(7r,A,A') A VA(A / T ^ 62(71, A, A)) ) , 
where ei := a* ■ (4,r.a'’“ [r^]) • a* is the REM that checks whether in a path some node is 
repeated (i.e., that it is not a simple path), and 62 := a*[r^]a* is the REM that checks that 
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the node stored in register r appears in a path. In fact, this query expresses that there is a 
path TT that it is simple (as expressed by the formula VAVA'-'ei(7r, A, A')), and every node of 
the graph database is mentioned in tt (as expressed by the formula VA(A 7^ _L —>■ 62 ( 71 , A, A))). 

We now show how to express in RL the bipartiteness property from graph theory. An 
undirected graph G = iV, E) is bipartite if its set of nodes can be partitioned into two 
sets Si and S 2 such that, for each edge {v,w) € E, either (i) v G Si and w G S 2 , or (ii) 
V G S 2 and w G Si. It is well-known that a graph database is bipartite if and only if it 
does not have cycles of odd length. The latter is expressible in RL since the existence of an 
odd-length cycle can be expressed as 37r3A3A'e(7r, A, A'), where e =Xr.a{aa)*[r^]. 

We now show how to express in RL that a graph database is a connected graph with 
an odd number of nodes. To this end, it is sufficient and necessary to express the existence 
of a hamiltonian path vr with an even number of edges in the graph. But this is a simple 
modification of our formula for expressing hamiltonicity: we add the check that tt has an 
even number of edges by adding the conjunct e(7r, 17 , where e = (aa)*, and close the 
entire formula under existential quantification of and v'. 

5.3. Tractability in data complexity. Let RL^ be the positive fragment of RL, i.e., 
the logic obtained from RL by forbidding negation but adding conjunctions (as they were 
not explicitly present in RL). It is easy to prove that the data complexity of the evaluation 
problem for RL^ is tractable (NLogspace). This fragment contains the class of conjunctive 
REMs, that has been previously identified as tractable in data complexity |ir)] . However, 
the expressive power of RL^ is limited as the following proposition shows. 

Proposition 5.5. The query (Q) from the introduction is not expressible in RL"^. 

Proof. Recall that Q is the following query: Find pairs of nodes x and y such that there 
is a node z and a path tt from x to 7/ in which each node is connected to z. Suppose for 
contradiction that there is a formula (p in RL+ over an alphabet S and registers ri,..., r^, 
expressing 3x3yQ. We may assume that p is of the form 

3x1... 3xn^37ri... 37r„2 3z7i... 

where is a disjunction of conjunctions of atoms. Let G = (R, E, k) be the following graph 



where each edge is labeled with a. The query 3x3y Q is true in G; hence, the formula cp must 
be true in G. That is, there is an assignment a mapping each variable Xj to a node in G, 
each path variable TTj to a path pi in G and each variable to a tuple in {_L, $, 1,... , n 2 3-1}^ 
such that 

(G,a) 1= f). 

Let G' be the graph (R, E', k) where E' is the set 

{{i, a, i -|- 1) : 1 < i < 712 } U {{i, a, $) : for some 1 < j < n 2 , (i, $) occurs in pj } 
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That is, we delete the edges (i,$) that a “does not use”. By definition of E', the formula 
i/j remains true in G' under the assignment a. In particular, (p is true in G'. This implies 
that 3x,y Q holds in G'. 

Now, for all 1 < j < n 2 , there is at most one natural number i such that (i, a, $) occurs 
in pj. This is simply because there is no path going through edges (i,a, $) and (i',a, $) if 
i ^ i'. This implies that the set 

{(i, a, $) : for some 1 < j < n 2 , {i, $) occurs in pj } 

contains at most n 2 edges. Since G admits n 2 + 1 edges of the form {i, a, $), there must be 
an edge (zq, a, $) occurring in G, but not in G'. This means that G' is a graph of the form 



In particular, 3x,y Q is not true in G', which contradicts the fact that p is true in G". □ 

On the other hand, increasing the expressive power of RL'*' with some simple forms of 
negation leads to intractability of query evaluation in data complexity: 

Proposition 5.6. There is a finite alphabet S and REMs 61 , 62 , 63,64 over S and a single 
register r, such that EiVAL(RL,(f)) is PsPACE-complete, where cp is either 37 r 3 A-'( 6 i( 7 r, T, A) V 
62 ( 71 , T, T)) or 37 rVA-.( 63 ( 7 r, T, A) V 64 ( 71 , T, T)). 

Proof. By the proof of Theorem 15.31 (and using the same notation), we know that for every 
Turing machine M, there is a formula xi'^) such that for all words w of size n, there is a 
graph Gw (of size polynomial in n) such that 

Gw 1= 37r-'x(7r) iff there is an accepting run of M over w using at most cn cells. 

Moreover, the formula x(7r) is a formula of the form 

V{6i(7r,T,T) : z € /} V \/{3A /j(7r,T, A) : j € J}, 

where {ei,fj : z € /,j G J} is a set of REMs that do not contain any U. Since REMs are 
closed under union, x('^) is equivalent to a formula of the form 

3A (6(7r,T,T) V /(7r,T, A)). 

Hence, the formula 37r-'x(7r) is equivalent to 

37rVA^(6(7r, T, T) V /(tt, T, A)). 

This proves that Eval(RL,(/i) is PSPACE-complete, where (p = 37rVA-'(e(7r, T, T)V/(7r, T, A)). 

Now we prove that Eval(RL, 0') is PsPACE-complete. where <p' is a formula of the form 
37r3A-'(e'(7r, T, T) V /'{tt, T, A)) , for some REMs e' and f . The intuition is as follows. The 
difference between p and p' is that in p, we may choose the data value that is in the register 
after checking that / is true. However, in p', we must be able to store any value in the 
register after checking that f is true. We will make two changes to make this possible. 

First, we modify the graph Gw in such a way that two arbitrary nodes are always 
reachable. This can be easily achieved by adding an edge from the “right-most node” of the 
graph H to the “left-most node” of the graph Iw (allowing to encode the initial configuration 
of a run). Second, we modify the REMs of p in such a way that the label of a path satisfying 
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those REMs, encodes an accepting run and after reaching the final state, it goes through 
all the nodes of Gw Hence, once we checked that the run reaches the final state, we can 
simply store any value in the register. We leave out the details, as the intuition is pretty 
simple and the details a bit tedious. 

□ 

In the case of basic navigational languages for graph databases, it is possible to increase 
the expressive power - without affecting the cost of evaluation - by extending formulas with 
a branching operator (in the style of the class of nested regular expressions IS]). The same 
idea can be applied in our scenario, by extending atomic REM formulas in RL"*" with such a 
branching operator. The resulting language is more expressive than RL"*" (in particular, this 
extension can express query (Q)), yet remains tractable in data complexity. We formalize 
this idea below. 

The class of nested REMs (NREM) extends REM with a nesting operator (•) defined 
as follows: If e is an NREM then (e) is also an NREM. Intuitively, the formula (e) filters 
those nodes in a data graph that are the origin of a path that can be parsed according to 
e. Eormally, if e is an NREM over k registers and G is a data graph, then [(e)]]^ consists 
of all tuples of the form (u, X, p = u, u, A) such that (tt. A, p', v, A') G Mg, for some node v 
in G, path p' in G, and /c-tuple A' over 'D±. 

Let NRL'*' be the logic that is obtained from RL"*" by allowing atomic formulas of the 
form e(7r. A, A'), for e an NREM. Given a data graph G and an assignment a for tt, A and 
A' over G, we write as before (G, a) \= e(7r. A, A') if and only if Q;(7r) goes from u to u and 
{u, a{X) , a{7r) , V , a(A')) G Mg- The semantics of NRE"*" is thus obtained from the semantics 
of these atomic formulas in the expected way. The following example shows that query (Q) 
is expressible in NRL"*“, and, therefore, that NRL"*" increases the expressiveness of RL"*". 

Example 5.3. Over graph databases, the query (Q) from the introduction is expressible 
in NRL'*' using the following formula over S = {a} and register r: 

(/> = dvrdA ((x, vr, y) A e(7r,A,A)), 

where e := ((ei) • a)*(ei), for ei = a*[r^]. Intuitively, ei checks in a path whether its last 
node is precisely the node stored in register r, and thus e checks whether every node in a 
path can reach the node stored in register r. Therefore, the formula (p defines the set of 
pairs (x, y) of nodes, such that there is a path vr that goes from x to y and a register value 
A (i.e., a node A) that satisfies that every node in vr is connected to A. □ 

The extra expressive power of NRL+ over RL'*' does not affect the data complexity of 
query evaluation: 

Theorem 5.7. Evaluation of NRL'*' formulas is in NLogspace in terms of data complexity. 

Proof. Let G = (V, E, n) be a data graph and (p an NRL+ formula. Also, let D = I 

u G E}. We assume without loss of generality that p is Boolean, that is, we study the 
complexity of deciding whether G \= 4>. In the case when p is not Boolean, that is, when 
the input consists of G and an assignment a for p over G, we simply replace each free 
variable r/ in by a(r/), and then use the evaluation algorithm we describe below for the 
resulting formula. 

Assume without loss of generality that (p is of the form where x is a tuple of 

node variables, p is a tuple of register assignment variables, tt is a tuple of path variables, and 
Ip is quantifier-free. Assume also that {ei,..., Cm} is the set of NREMs mentioned in p, and 
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that each such NREM is over {ri,..., r^}. The evaluation algorithm does the following: It 
hrst guesses witnesses for the existentially quantihed node and register assignment variables. 
Assume that the guess for x is v, where h is a tuple of nodes of the same arity than x, and 
that the guess for u is A, where A is a tuple of elements in (iAU{_L})^ of the same arity 
than u. Clearly, both v and A can be represented using only logarithmic space (since 4>, and 
therefore k, is fixed). 

The algorithm then guesses a witness ^ for each existentially quantified path variable vr 
in TT. This witness codifies all the information that we need to know about the actual path p 
that inteprets vr. In our case, ^ is a string that lists (in a precise order) the endpoints u and 
u of p and the tuples (e, A, A'), for e G {ei,..., Cm} and A, A' € A, such that (u, A, p, v, A') € 
[[ejc- Let ^ be the witnesses for the tuple vf. It is not hard to see that ^ can be represented 
using logarithmic space. 

Finally, the algorithm checks that the guess satisfies ijj. Since ^jJ is a Boolean 

combination of atomic formulas, we only have to explain how to do this in logarithmic space 
for each atomic formula of the logic. Atomic formulas of the form x = y, v = v' and u = T 
are self-evident. Atomic formulas of the form vr = vr' only require checking whether the 
witness of vr is equal to the witness of vrh Atomic formulas of the form (x, tt, y) require 
checking in the witness ^ of tt whether its endpoints correspond to the witnesses of x and y. 
Finally, formulas of the form e(7r, v, u') require checking in ^ whether (u. A, p, v, X') G Hg, 
where A and X' are the witnesses of u and v', respectively. Clearly, any of this can be done 
in logarithmic space. 

The only thing that remains to be done is checking that ^ is consistent with G. That 
is, we need to check the following for each path variable vr that is mentioned in vr whose 
witness in ^ is There is a path p in G whose codification corresponds to In other words, 
if ^ tells us that the endpoints of the path are u and v, we need to check in G if there is 
a path p from u to u such that, for each e € {ei,..., em} and A, A' G A, it is the case that 
(u, A, p, V, X') G [[ejc each time that ^ tells us so. We explain next how this can be done in 
NLogspace combining techniques from RFM and nested regular expressions evaluation. 

The algorithm starts computing, for each expression of the form (e) that appears in any 
of the Cj’s, the set 17(e) of pairs of the form (w, A), for w a node in G and A a fc-tuple over 
Du {T}, such that (w,X,p' = w,w,X) G [[(e)]G- In order to do so it proceeds recursively 
depending on the nesting depth of the expression (e), which is 1 if e contains no nested 
subexpression, and it is 1 more than the maximum nesting depth of any subexpression of 
e of the form (e') otherwise. The algorithm starts with those expressions (e) of nesting 
depth one, i.e., when e is an REM (no nesting). Using techniques from [TO] it is possible to 
compute in NLogspace the set t/(e), for each such expression (e). Then it continues with 
expressions of nesting depth two. In such case, it uses the same aforementioned techniques, 
but each time the procedure is asked to check whether (w,X,p' = w,w,X) G [[(e')]G 5 for 
a subexpression (e'} of nesting depth one, it simply checks whether (w,X) G U{e'). The 
process continues iteratively in this way until all sets ?7(e), for (e) a subexpression of any 
of the Cj’s, are computed. Clearly, this iterative process can be performed in NLogspace. 

Once the previous step is finished, the algorithm checks whether there is a path p from u 
to V such that, for each e G {ei,..., em} and A, A' G A, it is the case that (u, A, p, v, A') G [cJg 
each time that ^ tells us so. This can be done in NLogspace applying the same techniques 
used in the previous paragraph and the knowledge provided by the sets U{e). The result 
follows from the fact that NLogspace functions are closed under composition. Q 
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From the proof of Theorem 15.71 it also follows that NRL"*" formulas can be evaluated in 
P SPACE in combined complexity. 

6 . Conclusions and Future Work 

We have proven that the data complexity of walk logic (WL) is nonelementary, which 
rules out the practicality of the logic. We have proposed register logic (RL), which is an 
extension of regular expressions with memory. Our results in this paper suggest that register 
logic is capable of expressing natural queries about interactions between data and topology 
in data graphs, while still preserving the elementary data complexity of query evaluation 
(Pspace). Finally, we showed how to make register logic more tractable in data complexity 
(NLogspace) through the logic NRL^, while at the same time preserving some level of 
expressiveness of RL. 

We leave open several problems for future work. One interesting question is to study 
the expressive power of extensions of walk logic, in comparison to RL and ECRPQ”' from [4]. 
For example, we can consider extensions with regularity tests (i.e. an atomic formula testing 
whether a path belongs to a regular language). Even in this simple case, the expressive power 
of the resulting logic, compared to RL and ECRPQ”', is already not obvious. Secondly, we do 
not know whether NRh"*" is strictly more expressive than RL. Einally, we will also mention 
that expressibility of bipartiteness in WL is still open (an open question from [8]). We also 
leave open whether the query that a graph database is a connected graph with an even 
number of nodes is expressible in WL. 
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